Обсуждение: BufferAlloc: don't take two simultaneous locks

Поиск

Список

Период

Сортировка

BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

02 октября 2021 г., 01:25:57

Good day.

I found some opportunity in Buffer Manager code in BufferAlloc
function:
- When valid buffer is evicted, BufferAlloc acquires two partition
lwlocks: for partition for evicted block is in and partition for new
block placement.

It doesn't matter if there is small number of concurrent replacements.
But if there are a lot of concurrent backends replacing buffers,
complex dependency net quickly arose.

It could be easily seen with select-only pgbench with scale 100 and
shared buffers 128MB: scale 100 produces 1.5GB tables, and it certainly
doesn't fit shared buffers. This way performance starts to degrade at
~100 connections. Even with shared buffers 1GB it slowly degrades after
150 connections. 

But strictly speaking, there is no need to hold both lock
simultaneously. Buffer is pinned so other processes could not select it
for eviction. If tag is cleared and buffer removed from old partition
then other processes will not find it. Therefore it is safe to release
old partition lock before acquiring new partition lock.

If other process concurrently inserts same new buffer, then old buffer
is placed to bufmanager's freelist.

Additional optimisation: in case of old buffer is reused, there is no
need to put its BufferLookupEnt into dynahash's freelist. That reduces
lock contention a bit more. To acomplish this FreeListData.nentries is
changed to pg_atomic_u32/pg_atomic_u64 and atomic increment/decrement
is used.

Remark: there were bug in the `hash_update_hash_key`: nentries were not
kept in sync if freelist partitions differ. This bug were never
triggered because single use of `hash_update_hash_key` doesn't move
entry between partitions.

There is some tests results.

- pgbench with scale 100 were tested with --select-only (since we want
to test buffer manager alone). It produces 1.5GB table.
- two shared_buffers values were tested: 128MB and 1GB.
- second best result were taken among five runs

Test were made in three system configurations:
- notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
- Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
- same Xeon X5675 but restricted to single socket
  (with numactl -m 0 -N 0)

Results for i7-1165G7:

  conns |     master |    patched |  master 1G | patched 1G 
--------+------------+------------+------------+------------
      1 |      29667 |      29079 |      29425 |      29411 
      2 |      55577 |      55553 |      57974 |      57223 
      3 |      87393 |      87924 |      87246 |      89210 
      5 |     136222 |     136879 |     133775 |     133949 
      7 |     179865 |     176734 |     178297 |     175559 
     17 |     215953 |     214708 |     222908 |     223651 
     27 |     211162 |     213014 |     220506 |     219752 
     53 |     211620 |     218702 |     220906 |     225218 
     83 |     213488 |     221799 |     219075 |     228096 
    107 |     212018 |     222110 |     222502 |     227825 
    139 |     207068 |     220812 |     218191 |     226712 
    163 |     203716 |     220793 |     213498 |     226493 
    191 |     199248 |     217486 |     210994 |     221026 
    211 |     195887 |     217356 |     209601 |     219397 
    239 |     193133 |     215695 |     209023 |     218773 
    271 |     190686 |     213668 |     207181 |     219137 
    307 |     188066 |     214120 |     205392 |     218782 
    353 |     185449 |     213570 |     202120 |     217786 
    397 |     182173 |     212168 |     201285 |     216489 

Results for 1 socket X5675

  conns |     master |    patched |  master 1G | patched 1G 
--------+------------+------------+------------+------------
      1 |      16864 |      16584 |      17419 |      17630 
      2 |      32764 |      32735 |      34593 |      34000 
      3 |      47258 |      46022 |      49570 |      47432 
      5 |      64487 |      64929 |      68369 |      68885 
      7 |      81932 |      82034 |      87543 |      87538 
     17 |     114502 |     114218 |     127347 |     127448 
     27 |     116030 |     115758 |     130003 |     128890 
     53 |     116814 |     117197 |     131142 |     131080 
     83 |     114438 |     116704 |     130198 |     130985 
    107 |     113255 |     116910 |     129932 |     131468 
    139 |     111577 |     116929 |     129012 |     131782 
    163 |     110477 |     116818 |     128628 |     131697 
    191 |     109237 |     116672 |     127833 |     131586 
    211 |     108248 |     116396 |     127474 |     131650 
    239 |     107443 |     116237 |     126731 |     131760 
    271 |     106434 |     115813 |     126009 |     131526 
    307 |     105077 |     115542 |     125279 |     131421 
    353 |     104516 |     115277 |     124491 |     131276 
    397 |     103016 |     114842 |     123624 |     131019 

Results for 2 socket x5675

  conns |     master |    patched |  master 1G | patched 1G 
--------+------------+------------+------------+------------
      1 |      16323 |      16280 |      16959 |      17598 
      2 |      30510 |      31431 |      33763 |      31690 
      3 |      45051 |      45834 |      48896 |      47991 
      5 |      71800 |      73208 |      78077 |      74714 
      7 |      89792 |      89980 |      95986 |      96662 
     17 |     178319 |     177979 |     195566 |     196143 
     27 |     210475 |     205209 |     226966 |     235249 
     53 |     222857 |     220256 |     252673 |     251041 
     83 |     219652 |     219938 |     250309 |     250464 
    107 |     218468 |     219849 |     251312 |     251425 
    139 |     210486 |     217003 |     250029 |     250695 
    163 |     204068 |     218424 |     248234 |     252940 
    191 |     200014 |     218224 |     246622 |     253331 
    211 |     197608 |     218033 |     245331 |     253055 
    239 |     195036 |     218398 |     243306 |     253394 
    271 |     192780 |     217747 |     241406 |     253148 
    307 |     189490 |     217607 |     239246 |     253373 
    353 |     186104 |     216697 |     236952 |     253034 
    397 |     183507 |     216324 |     234764 |     252872 

As can be seen, patched version degrades much slower than master.
(Or even doesn't degrade with 1G shared buffer on older processor).

PS.

There is a room for further improvements:
- buffer manager's freelist could be partitioned
- dynahash's freelist could be sized/aligned to CPU cache line
- in fact, there is no need in dynahash at all. It is better to make
  custom hash-table using BufferDesc as entries. BufferDesc has spare
  space for link to next and hashvalue.

regards,
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Вложения

Re: BufferAlloc: don't take two simultaneous locks

От

Zhihong Yu

Дата:

02 октября 2021 г., 01:46:26

On Fri, Oct 1, 2021 at 3:26 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

Good day.

I found some opportunity in Buffer Manager code in BufferAlloc
function:
- When valid buffer is evicted, BufferAlloc acquires two partition
lwlocks: for partition for evicted block is in and partition for new
block placement.

It doesn't matter if there is small number of concurrent replacements.
But if there are a lot of concurrent backends replacing buffers,
complex dependency net quickly arose.

It could be easily seen with select-only pgbench with scale 100 and
shared buffers 128MB: scale 100 produces 1.5GB tables, and it certainly
doesn't fit shared buffers. This way performance starts to degrade at
~100 connections. Even with shared buffers 1GB it slowly degrades after
150 connections.

But strictly speaking, there is no need to hold both lock
simultaneously. Buffer is pinned so other processes could not select it
for eviction. If tag is cleared and buffer removed from old partition
then other processes will not find it. Therefore it is safe to release
old partition lock before acquiring new partition lock.

If other process concurrently inserts same new buffer, then old buffer
is placed to bufmanager's freelist.

Additional optimisation: in case of old buffer is reused, there is no
need to put its BufferLookupEnt into dynahash's freelist. That reduces
lock contention a bit more. To acomplish this FreeListData.nentries is
changed to pg_atomic_u32/pg_atomic_u64 and atomic increment/decrement
is used.

Remark: there were bug in the `hash_update_hash_key`: nentries were not
kept in sync if freelist partitions differ. This bug were never
triggered because single use of `hash_update_hash_key` doesn't move
entry between partitions.

There is some tests results.

- pgbench with scale 100 were tested with --select-only (since we want
to test buffer manager alone). It produces 1.5GB table.
- two shared_buffers values were tested: 128MB and 1GB.
- second best result were taken among five runs

Test were made in three system configurations:
- notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
- Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
- same Xeon X5675 but restricted to single socket
(with numactl -m 0 -N 0)

Results for i7-1165G7:

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 29667 | 29079 | 29425 | 29411
2 | 55577 | 55553 | 57974 | 57223
3 | 87393 | 87924 | 87246 | 89210
5 | 136222 | 136879 | 133775 | 133949
7 | 179865 | 176734 | 178297 | 175559
17 | 215953 | 214708 | 222908 | 223651
27 | 211162 | 213014 | 220506 | 219752
53 | 211620 | 218702 | 220906 | 225218
83 | 213488 | 221799 | 219075 | 228096
107 | 212018 | 222110 | 222502 | 227825
139 | 207068 | 220812 | 218191 | 226712
163 | 203716 | 220793 | 213498 | 226493
191 | 199248 | 217486 | 210994 | 221026
211 | 195887 | 217356 | 209601 | 219397
239 | 193133 | 215695 | 209023 | 218773
271 | 190686 | 213668 | 207181 | 219137
307 | 188066 | 214120 | 205392 | 218782
353 | 185449 | 213570 | 202120 | 217786
397 | 182173 | 212168 | 201285 | 216489

Results for 1 socket X5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16864 | 16584 | 17419 | 17630
2 | 32764 | 32735 | 34593 | 34000
3 | 47258 | 46022 | 49570 | 47432
5 | 64487 | 64929 | 68369 | 68885
7 | 81932 | 82034 | 87543 | 87538
17 | 114502 | 114218 | 127347 | 127448
27 | 116030 | 115758 | 130003 | 128890
53 | 116814 | 117197 | 131142 | 131080
83 | 114438 | 116704 | 130198 | 130985
107 | 113255 | 116910 | 129932 | 131468
139 | 111577 | 116929 | 129012 | 131782
163 | 110477 | 116818 | 128628 | 131697
191 | 109237 | 116672 | 127833 | 131586
211 | 108248 | 116396 | 127474 | 131650
239 | 107443 | 116237 | 126731 | 131760
271 | 106434 | 115813 | 126009 | 131526
307 | 105077 | 115542 | 125279 | 131421
353 | 104516 | 115277 | 124491 | 131276
397 | 103016 | 114842 | 123624 | 131019

Results for 2 socket x5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16323 | 16280 | 16959 | 17598
2 | 30510 | 31431 | 33763 | 31690
3 | 45051 | 45834 | 48896 | 47991
5 | 71800 | 73208 | 78077 | 74714
7 | 89792 | 89980 | 95986 | 96662
17 | 178319 | 177979 | 195566 | 196143
27 | 210475 | 205209 | 226966 | 235249
53 | 222857 | 220256 | 252673 | 251041
83 | 219652 | 219938 | 250309 | 250464
107 | 218468 | 219849 | 251312 | 251425
139 | 210486 | 217003 | 250029 | 250695
163 | 204068 | 218424 | 248234 | 252940
191 | 200014 | 218224 | 246622 | 253331
211 | 197608 | 218033 | 245331 | 253055
239 | 195036 | 218398 | 243306 | 253394
271 | 192780 | 217747 | 241406 | 253148
307 | 189490 | 217607 | 239246 | 253373
353 | 186104 | 216697 | 236952 | 253034
397 | 183507 | 216324 | 234764 | 252872

As can be seen, patched version degrades much slower than master.
(Or even doesn't degrade with 1G shared buffer on older processor).

PS.

There is a room for further improvements:
- buffer manager's freelist could be partitioned
- dynahash's freelist could be sized/aligned to CPU cache line
- in fact, there is no need in dynahash at all. It is better to make
custom hash-table using BufferDesc as entries. BufferDesc has spare
space for link to next and hashvalue.

regards,
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Hi,

Improvement is impressive.

For BufTableFreeDeleted(), since it only has one call, maybe its caller can invoke hash_return_to_freelist() directly.

For free_list_decrement_nentries():

+ Assert(hctl->freeList[freelist_idx].nentries.value < MAX_NENTRIES);

Is the assertion necessary ? There is similar assertion in free_list_increment_nentries() which would maintain hctl->freeList[freelist_idx].nentries.value <= MAX_NENTRIES.

Cheers

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

04 октября 2021 г., 07:18:56

В Пт, 01/10/2021 в 15:46 -0700, Zhihong Yu wrote:
> 
> 
> On Fri, Oct 1, 2021 at 3:26 PM Yura Sokolov <y.sokolov@postgrespro.ru>
> wrote:
> > Good day.
> > 
> > I found some opportunity in Buffer Manager code in BufferAlloc
> > function:
> > - When valid buffer is evicted, BufferAlloc acquires two partition
> > lwlocks: for partition for evicted block is in and partition for new
> > block placement.
> > 
> > It doesn't matter if there is small number of concurrent
> > replacements.
> > But if there are a lot of concurrent backends replacing buffers,
> > complex dependency net quickly arose.
> > 
> > It could be easily seen with select-only pgbench with scale 100 and
> > shared buffers 128MB: scale 100 produces 1.5GB tables, and it
> > certainly
> > doesn't fit shared buffers. This way performance starts to degrade
> > at
> > ~100 connections. Even with shared buffers 1GB it slowly degrades
> > after
> > 150 connections. 
> > 
> > But strictly speaking, there is no need to hold both lock
> > simultaneously. Buffer is pinned so other processes could not select
> > it
> > for eviction. If tag is cleared and buffer removed from old
> > partition
> > then other processes will not find it. Therefore it is safe to
> > release
> > old partition lock before acquiring new partition lock.
> > 
> > If other process concurrently inserts same new buffer, then old
> > buffer
> > is placed to bufmanager's freelist.
> > 
> > Additional optimisation: in case of old buffer is reused, there is
> > no
> > need to put its BufferLookupEnt into dynahash's freelist. That
> > reduces
> > lock contention a bit more. To acomplish this FreeListData.nentries
> > is
> > changed to pg_atomic_u32/pg_atomic_u64 and atomic
> > increment/decrement
> > is used.
> > 
> > Remark: there were bug in the `hash_update_hash_key`: nentries were
> > not
> > kept in sync if freelist partitions differ. This bug were never
> > triggered because single use of `hash_update_hash_key` doesn't move
> > entry between partitions.
> > 
> > There is some tests results.
> > 
> > - pgbench with scale 100 were tested with --select-only (since we
> > want
> > to test buffer manager alone). It produces 1.5GB table.
> > - two shared_buffers values were tested: 128MB and 1GB.
> > - second best result were taken among five runs
> > 
> > Test were made in three system configurations:
> > - notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
> > - Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
> > - same Xeon X5675 but restricted to single socket
> >   (with numactl -m 0 -N 0)
> > 
> > Results for i7-1165G7:
> > 
> >   conns |     master |    patched |  master 1G | patched 1G 
> > --------+------------+------------+------------+------------
> >       1 |      29667 |      29079 |      29425 |      29411 
> >       2 |      55577 |      55553 |      57974 |      57223 
> >       3 |      87393 |      87924 |      87246 |      89210 
> >       5 |     136222 |     136879 |     133775 |     133949 
> >       7 |     179865 |     176734 |     178297 |     175559 
> >      17 |     215953 |     214708 |     222908 |     223651 
> >      27 |     211162 |     213014 |     220506 |     219752 
> >      53 |     211620 |     218702 |     220906 |     225218 
> >      83 |     213488 |     221799 |     219075 |     228096 
> >     107 |     212018 |     222110 |     222502 |     227825 
> >     139 |     207068 |     220812 |     218191 |     226712 
> >     163 |     203716 |     220793 |     213498 |     226493 
> >     191 |     199248 |     217486 |     210994 |     221026 
> >     211 |     195887 |     217356 |     209601 |     219397 
> >     239 |     193133 |     215695 |     209023 |     218773 
> >     271 |     190686 |     213668 |     207181 |     219137 
> >     307 |     188066 |     214120 |     205392 |     218782 
> >     353 |     185449 |     213570 |     202120 |     217786 
> >     397 |     182173 |     212168 |     201285 |     216489 
> > 
> > Results for 1 socket X5675
> > 
> >   conns |     master |    patched |  master 1G | patched 1G 
> > --------+------------+------------+------------+------------
> >       1 |      16864 |      16584 |      17419 |      17630 
> >       2 |      32764 |      32735 |      34593 |      34000 
> >       3 |      47258 |      46022 |      49570 |      47432 
> >       5 |      64487 |      64929 |      68369 |      68885 
> >       7 |      81932 |      82034 |      87543 |      87538 
> >      17 |     114502 |     114218 |     127347 |     127448 
> >      27 |     116030 |     115758 |     130003 |     128890 
> >      53 |     116814 |     117197 |     131142 |     131080 
> >      83 |     114438 |     116704 |     130198 |     130985 
> >     107 |     113255 |     116910 |     129932 |     131468 
> >     139 |     111577 |     116929 |     129012 |     131782 
> >     163 |     110477 |     116818 |     128628 |     131697 
> >     191 |     109237 |     116672 |     127833 |     131586 
> >     211 |     108248 |     116396 |     127474 |     131650 
> >     239 |     107443 |     116237 |     126731 |     131760 
> >     271 |     106434 |     115813 |     126009 |     131526 
> >     307 |     105077 |     115542 |     125279 |     131421 
> >     353 |     104516 |     115277 |     124491 |     131276 
> >     397 |     103016 |     114842 |     123624 |     131019 
> > 
> > Results for 2 socket x5675
> > 
> >   conns |     master |    patched |  master 1G | patched 1G 
> > --------+------------+------------+------------+------------
> >       1 |      16323 |      16280 |      16959 |      17598 
> >       2 |      30510 |      31431 |      33763 |      31690 
> >       3 |      45051 |      45834 |      48896 |      47991 
> >       5 |      71800 |      73208 |      78077 |      74714 
> >       7 |      89792 |      89980 |      95986 |      96662 
> >      17 |     178319 |     177979 |     195566 |     196143 
> >      27 |     210475 |     205209 |     226966 |     235249 
> >      53 |     222857 |     220256 |     252673 |     251041 
> >      83 |     219652 |     219938 |     250309 |     250464 
> >     107 |     218468 |     219849 |     251312 |     251425 
> >     139 |     210486 |     217003 |     250029 |     250695 
> >     163 |     204068 |     218424 |     248234 |     252940 
> >     191 |     200014 |     218224 |     246622 |     253331 
> >     211 |     197608 |     218033 |     245331 |     253055 
> >     239 |     195036 |     218398 |     243306 |     253394 
> >     271 |     192780 |     217747 |     241406 |     253148 
> >     307 |     189490 |     217607 |     239246 |     253373 
> >     353 |     186104 |     216697 |     236952 |     253034 
> >     397 |     183507 |     216324 |     234764 |     252872 
> > 
> > As can be seen, patched version degrades much slower than master.
> > (Or even doesn't degrade with 1G shared buffer on older processor).
> > 
> > PS.
> > 
> > There is a room for further improvements:
> > - buffer manager's freelist could be partitioned
> > - dynahash's freelist could be sized/aligned to CPU cache line
> > - in fact, there is no need in dynahash at all. It is better to make
> >   custom hash-table using BufferDesc as entries. BufferDesc has
> > spare
> >   space for link to next and hashvalue.
> > 
> > regards,
> > Yura Sokolov
> > y.sokolov@postgrespro.ru
> > funny.falcon@gmail.com
> 
> Hi,
> Improvement is impressive.

Thank you!

> For BufTableFreeDeleted(), since it only has one call, maybe its
> caller can invoke hash_return_to_freelist() directly.

It will be a dirty break of abstraction. Everywhere we talk with
BufTable, and here will be hash ... eugh

> For free_list_decrement_nentries():
> 
> +   Assert(hctl->freeList[freelist_idx].nentries.value <
> MAX_NENTRIES);
> 
> Is the assertion necessary ? There is similar assertion in
> free_list_increment_nentries() which would maintain hctl-
> >freeList[freelist_idx].nentries.value <= MAX_NENTRIES.

Assertion in free_list_decrement_nentries is absolutely necessary:
it is direct translation of Assert(nentries>=0) from signed types
to unsigned. (Since there is no signed atomics in pg, I had to convert
signed `long nentries` to unsigned `pg_atomic_uXX nentries`).

Assertion in free_list_increment_nentries is not necessary. But it
doesn't hurt either - it is just Assert that doesn't compile into
production code.


regards

Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Re: BufferAlloc: don't take two simultaneous locks

От

Andrey Borodin

Дата:

22 января 2022 г., 10:56:14


> 21 дек. 2021 г., в 10:23, Yura Sokolov <y.sokolov@postgrespro.ru> написал(а):
>
> <v1-0001-bufmgr-do-not-acquire-two-partition-lo.patch>

Hi Yura!

I've took a look into the patch. The idea seems reasonable to me: clearing\evicting old buffer and placing new one seem
tobe different units of work, there is no need to couple both partition locks together. And the claimed performance
impactis fascinating! Though I didn't verify it yet. 

On a first glance API change in BufTable does not seem obvious to me. Is void *oldelem actually BufferTag * or maybe
BufferLookupEnt*? What if we would like to use or manipulate with oldelem in future? 

And the name BufTableFreeDeleted() confuses me a bit. You know, in C we usually free(), but in C++ we delete [], and
herewe do both... Just to be sure. 

Thanks!

Best regards, Andrey Borodin.

Re: BufferAlloc: don't take two simultaneous locks

От

Kyotaro Horiguchi

Дата:

24 января 2022 г., 11:19:48

At Sat, 22 Jan 2022 12:56:14 +0500, Andrey Borodin <x4mmm@yandex-team.ru> wrote in 
> I've took a look into the patch. The idea seems reasonable to me:
> clearing\evicting old buffer and placing new one seem to be
> different units of work, there is no need to couple both partition
> locks together. And the claimed performance impact is fascinating!
> Though I didn't verify it yet.

The need for having both locks came from, I seems to me, that the
function was moving a buffer between two pages, and that there is a
moment where buftable holds two entries for one buffer.  It seems to
me this patch is trying to move a victim buffer to new page via
"unallocated" state and to avoid the buftable from having duplicate
entries for the same buffer.  The outline of the story sounds
reasonable.

> On a first glance API change in BufTable does not seem obvious to
> me. Is void *oldelem actually BufferTag * or maybe BufferLookupEnt
> *? What if we would like to use or manipulate with oldelem in
> future?
> 
> And the name BufTableFreeDeleted() confuses me a bit. You know, in C
> we usually free(), but in C++ we delete [], and here we do
> both... Just to be sure.

Honestly, I don't like the API change at all as the change allows a
dynahash to be in a (even tentatively) broken state and bufmgr touches
too much of dynahash details.  Couldn't we get a good extent of
benefit without that invasive changes?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

От

Michail Nikolaev

Дата:

30 января 2022 г., 20:27:43

Hello, Yura.

Test results look promising. But it seems like the naming and dynahash
API change is a little confusing.

1) I think it is better to split the main part and atomic nentries
optimization into separate commits.
2) Also, it would be nice to also fix hash_update_hash_key bug :)
3) Do we really need a SIZEOF_LONG check? I think pg_atomic_uint64 is
fine these days.
4) Looks like hash_insert_with_hash_nocheck could potentially break
the hash table. Is it better to replace it with
hash_search_with_hash_value with HASH_ATTACH action?
5) In such a case hash_delete_skip_freelist with
hash_search_with_hash_value with HASH_DETTACH.
6) And then hash_return_to_freelist -> hash_dispose_dettached_entry?

Another approach is a new version of hash_update_hash_key with
callbacks. Probably it is the most "correct" way to keep a hash table
implementation details closed. It should be doable, I think.

Thanks,
Michail.

Re: BufferAlloc: don't take two simultaneous locks

От

Michail Nikolaev

Дата:

06 февраля 2022 г., 19:34:54

Hello, Yura.

A one additional moment:

> 1332: Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
> 1333: CLEAR_BUFFERTAG(buf->tag);
> 1334: buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
> 1335: UnlockBufHdr(buf, buf_state);

I think there is no sense to unlock buffer here because it will be
locked after a few moments (and no one is able to find it somehow). Of
course, it should be unlocked in case of collision.

BTW, I still think is better to introduce some kind of
hash_update_hash_key and use it.

It may look like this:

// should be called with oldPartitionLock acquired
// newPartitionLock hold on return
// oldPartitionLock and newPartitionLock are not taken at the same time
// if newKeyPtr is present - existingEntry is removed
bool hash_update_hash_key_or_remove(
          HTAB *hashp,
          void *existingEntry,
          const void *newKeyPtr,
          uint32 newHashValue,
          LWLock *oldPartitionLock,
          LWLock *newPartitionLock
);

Thanks,
Michail.

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

16 февраля 2022 г., 10:33:18

В Вс, 06/02/2022 в 19:34 +0300, Michail Nikolaev пишет:
> Hello, Yura.
> 
> A one additional moment:
> 
> > 1332: Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
> > 1333: CLEAR_BUFFERTAG(buf->tag);
> > 1334: buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
> > 1335: UnlockBufHdr(buf, buf_state);
> 
> I think there is no sense to unlock buffer here because it will be
> locked after a few moments (and no one is able to find it somehow). Of
> course, it should be unlocked in case of collision.

UnlockBufHdr actually writes buf_state. Until it called, buffer
is in intermediate state and it is ... locked.

We have to write state with BM_TAG_VALID cleared before we
call BufTableDelete and release oldPartitionLock to maintain
consistency.

Perhaps, it could be cheated, and there is no harm to skip state
write at this point. But I'm not so confident to do it.

> 
> BTW, I still think is better to introduce some kind of
> hash_update_hash_key and use it.
> 
> It may look like this:
> 
> // should be called with oldPartitionLock acquired
> // newPartitionLock hold on return
> // oldPartitionLock and newPartitionLock are not taken at the same time
> // if newKeyPtr is present - existingEntry is removed
> bool hash_update_hash_key_or_remove(
>           HTAB *hashp,
>           void *existingEntry,
>           const void *newKeyPtr,
>           uint32 newHashValue,
>           LWLock *oldPartitionLock,
>           LWLock *newPartitionLock
> );

Interesting suggestion, thanks. I'll think about.
It has downside of bringing LWLock knowdlege to dynahash.c .
But otherwise looks smart.

---------

regards,
Yura Sokolov

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

16 февраля 2022 г., 10:40:56

Hello, all.

I thought about patch simplification, and tested version
without BufTable and dynahash api change at all.

It performs suprisingly well. It is just a bit worse
than v1 since there is more contention around dynahash's
freelist, but most of improvement remains.

I'll finish benchmarking and will attach graphs with
next message. Patch is attached here.

------

regards,
Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Вложения

v2-0001-bufmgr-do-not-acquire-two-partition-lo.patch

Re: BufferAlloc: don't take two simultaneous locks

От

Kyotaro Horiguchi

Дата:

17 февраля 2022 г., 08:16:47

At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in 
> Hello, all.
> 
> I thought about patch simplification, and tested version
> without BufTable and dynahash api change at all.
> 
> It performs suprisingly well. It is just a bit worse
> than v1 since there is more contention around dynahash's
> freelist, but most of improvement remains.
> 
> I'll finish benchmarking and will attach graphs with
> next message. Patch is attached here.

Thanks for the new patch.  The patch as a whole looks fine to me. But
some comments needs to be revised.

(existing comments)
> * To change the association of a valid buffer, we'll need to have
> * exclusive lock on both the old and new mapping partitions.
...
> * Somebody could have pinned or re-dirtied the buffer while we were
> * doing the I/O and making the new hashtable entry.  If so, we can't
> * recycle this buffer; we must undo everything we've done and start
> * over with a new victim buffer.

We no longer take a lock on the new partition and have no new hash
entry (if others have not yet done) at this point.


+     * Clear out the buffer's tag and flags.  We must do this to ensure that
+     * linear scans of the buffer array don't think the buffer is valid. We

The reason we can clear out the tag is it's safe to use the victim
buffer at this point. This comment needs to mention that reason.

+     *
+     * Since we are single pinner, there should no be PIN_COUNT_WAITER or
+     * IO_IN_PROGRESS (flags that were not cleared in previous code).
+     */
+    Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);

It seems like to be a test for potential bugs in other functions.  As
the comment is saying, we are sure that no other processes are pinning
the buffer and the existing code doesn't seem to be care about that
condition.  Is it really needed?


+    /*
+     * Try to make a hashtable entry for the buffer under its new tag. This
+     * could fail because while we were writing someone else allocated another

The most significant point of this patch is the reason that the victim
buffer is protected from stealing until it is set up for new tag. I
think we need an explanation about the protection here.


+     * buffer for the same block we want to read in. Note that we have not yet
+     * removed the hashtable entry for the old tag.

Since we have removed the hash table entry for the old tag at this
point, the comment got wrong.


+         * the first place.  First, give up the buffer we were planning to use
+         * and put it to free lists.
..
+        StrategyFreeBuffer(buf);

This is one downside of this patch. But it seems to me that the odds
are low that many buffers are freed in a short time by this logic.  By
the way it would be better if the sentence starts with "First" has a
separate comment section.


(existing comment)
|     * Okay, it's finally safe to rename the buffer.

We don't "rename" the buffer here.  And the safety is already
establishsed at the end of the oldPartitionLock section. So it would
be just something like "Now allocate the victim buffer for the new
tag"?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

От

Simon Riggs

Дата:

25 февраля 2022 г., 07:35:49

On Mon, 21 Feb 2022 at 08:06, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
>
> Good day, Kyotaro Horiguchi and hackers.
>
> В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:
> > At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in
> > > Hello, all.
> > >
> > > I thought about patch simplification, and tested version
> > > without BufTable and dynahash api change at all.
> > >
> > > It performs suprisingly well. It is just a bit worse
> > > than v1 since there is more contention around dynahash's
> > > freelist, but most of improvement remains.
> > >
> > > I'll finish benchmarking and will attach graphs with
> > > next message. Patch is attached here.
> >
> > Thanks for the new patch.  The patch as a whole looks fine to me. But
> > some comments needs to be revised.
>
> Thank you for review and remarks.

v3 gets the buffer partition locking right, well done, great results!

In v3, the comment at line 1279 still implies we take both locks
together, which is not now the case.

Dynahash actions are still possible. You now have the BufTableDelete
before the BufTableInsert, which opens up the possibility I discussed
here:
http://postgr.es/m/CANbhV-F0H-8oB_A+m=55hP0e0QRL=RdDDQuSXMTFt6JPrdX+pQ@mail.gmail.com
(Apologies for raising a similar topic, I hadn't noticed this thread
before; thanks to Horiguchi-san for pointing this out).

v1 had a horrible API (sorry!) where you returned the entry and then
explicitly re-used it. I think we *should* make changes to dynahash,
but not with the API you proposed.

Proposal for new BufTable API
BufTableReuse() - similar to BufTableDelete() but does NOT put entry
back on freelist, we remember it in a private single item cache in
dynahash
BufTableAssign() - similar to BufTableInsert() but can only be
executed directly after BufTableReuse(), fails with ERROR otherwise.
Takes the entry from single item cache and re-assigns it to new tag

In dynahash we have two new modes that match the above
HASH_REUSE - used by BufTableReuse(), similar to HASH_REMOVE, but
places entry on the single item cache, avoiding freelist
HASH_ASSIGN - used by BufTableAssign(), similar to HASH_ENTER, but
uses the entry from the single item cache, rather than asking freelist
This last call can fail if someone else already inserted the tag, in
which case it adds the single item cache entry back onto freelist

Notice that single item cache is not in shared memory, so on abort we
should give it back, so we probably need an extra API call for that
also to avoid leaking an entry.

Doing it this way allows us to
* avoid touching freelists altogether in the common path - we know we
are about to reassign the entry, so we do remember it - no contention
from other backends, no borrowing etc..
* avoid sharing the private details outside of the dynahash module
* allows us to use the same technique elsewhere that we have
partitioned hash tables

This approach is cleaner than v1, but should also perform better
because there will be a 1:1 relationship between a buffer and its
dynahash entry, most of the time.

With these changes, I think we will be able to *reduce* the number of
freelists for partitioned dynahash from 32 to maybe 8, as originally
speculated by Robert in 2016:
   https://www.postgresql.org/message-id/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
since the freelists will be much less contended with the above approach

It would be useful to see performance with a higher number of connections, >400.

--
Simon Riggs                http://www.EnterpriseDB.com/

Re: BufferAlloc: don't take two simultaneous locks

От

Andres Freund

Дата:

25 февраля 2022 г., 11:04:55

Hi,

On 2022-02-21 11:06:49 +0300, Yura Sokolov wrote:
> From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
> From: Yura Sokolov <y.sokolov@postgrespro.ru>
> Date: Mon, 21 Feb 2022 08:49:03 +0300
> Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.
> 
> Acquiring two partition locks leads to complex dependency chain that hurts
> at high concurrency level.
> 
> There is no need to hold both lock simultaneously. Buffer is pinned so
> other processes could not select it for eviction. If tag is cleared and
> buffer removed from old partition other processes will not find it.
> Therefore it is safe to release old partition lock before acquiring
> new partition lock.

Yes, the current design is pretty nonsensical. It leads to really absurd stuff
like holding the relation extension lock while we write out old buffer
contents etc.



> +     * We have pinned buffer and we are single pinner at the moment so there
> +     * is no other pinners.

Seems redundant.


> We hold buffer header lock and exclusive partition
> +     * lock if tag is valid. Given these statements it is safe to clear tag
> +     * since no other process can inspect it to the moment.
> +     */

Could we share code with InvalidateBuffer here? It's not quite the same code,
but nearly the same.


> +     * The usage_count starts out at 1 so that the buffer can survive one
> +     * clock-sweep pass.
> +     *
> +     * We use direct atomic OR instead of Lock+Unlock since no other backend
> +     * could be interested in the buffer. But StrategyGetBuffer,
> +     * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
> +     * compare tag, and UnlockBufHdr does raw write to state. So we have to
> +     * spin if we found buffer locked.

So basically the first half of of the paragraph is wrong, because no, we
can't?


> +     * Note that we write tag unlocked. It is also safe since there is always
> +     * check for BM_VALID when tag is compared.



>       */
>      buf->tag = newTag;
> -    buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
> -                   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
> -                   BUF_USAGECOUNT_MASK);
>      if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
> -        buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> +        new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
>      else
> -        buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> -
> -    UnlockBufHdr(buf, buf_state);
> +        new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
>  
> -    if (oldPartitionLock != NULL)
> +    buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
> +    while (unlikely(buf_state & BM_LOCKED))

I don't think it's safe to atomic in arbitrary bits. If somebody else has
locked the buffer header in this moment, it'll lead to completely bogus
results, because unlocking overwrites concurrently written contents (which
there shouldn't be any, but here there are)...

And or'ing contents in also doesn't make sense because we it doesn't work to
actually unset any contents?

Why don't you just use LockBufHdr/UnlockBufHdr?

Greetings,

Andres Freund

Re: BufferAlloc: don't take two simultaneous locks

От

Kyotaro Horiguchi

Дата:

25 февраля 2022 г., 12:14:33

At Fri, 25 Feb 2022 00:04:55 -0800, Andres Freund <andres@anarazel.de> wrote in 
> Why don't you just use LockBufHdr/UnlockBufHdr?

FWIW, v2 looked fine to me in regards to this point.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

25 февраля 2022 г., 12:24:40

Hello, Simon.

В Пт, 25/02/2022 в 04:35 +0000, Simon Riggs пишет:
> On Mon, 21 Feb 2022 at 08:06, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> > Good day, Kyotaro Horiguchi and hackers.
> > 
> > В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:
> > > At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in
> > > > Hello, all.
> > > > 
> > > > I thought about patch simplification, and tested version
> > > > without BufTable and dynahash api change at all.
> > > > 
> > > > It performs suprisingly well. It is just a bit worse
> > > > than v1 since there is more contention around dynahash's
> > > > freelist, but most of improvement remains.
> > > > 
> > > > I'll finish benchmarking and will attach graphs with
> > > > next message. Patch is attached here.
> > > 
> > > Thanks for the new patch.  The patch as a whole looks fine to me. But
> > > some comments needs to be revised.
> > 
> > Thank you for review and remarks.
> 
> v3 gets the buffer partition locking right, well done, great results!
> 
> In v3, the comment at line 1279 still implies we take both locks
> together, which is not now the case.
> 
> Dynahash actions are still possible. You now have the BufTableDelete
> before the BufTableInsert, which opens up the possibility I discussed
> here:
> http://postgr.es/m/CANbhV-F0H-8oB_A+m=55hP0e0QRL=RdDDQuSXMTFt6JPrdX+pQ@mail.gmail.com
> (Apologies for raising a similar topic, I hadn't noticed this thread
> before; thanks to Horiguchi-san for pointing this out).
> 
> v1 had a horrible API (sorry!) where you returned the entry and then
> explicitly re-used it. I think we *should* make changes to dynahash,
> but not with the API you proposed.
> 
> Proposal for new BufTable API
> BufTableReuse() - similar to BufTableDelete() but does NOT put entry
> back on freelist, we remember it in a private single item cache in
> dynahash
> BufTableAssign() - similar to BufTableInsert() but can only be
> executed directly after BufTableReuse(), fails with ERROR otherwise.
> Takes the entry from single item cache and re-assigns it to new tag
> 
> In dynahash we have two new modes that match the above
> HASH_REUSE - used by BufTableReuse(), similar to HASH_REMOVE, but
> places entry on the single item cache, avoiding freelist
> HASH_ASSIGN - used by BufTableAssign(), similar to HASH_ENTER, but
> uses the entry from the single item cache, rather than asking freelist
> This last call can fail if someone else already inserted the tag, in
> which case it adds the single item cache entry back onto freelist
> 
> Notice that single item cache is not in shared memory, so on abort we
> should give it back, so we probably need an extra API call for that
> also to avoid leaking an entry.

Why there is need for this? Which way backend could be forced to abort
between BufTableReuse and BufTableAssign in this code path? I don't
see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
something.

> 
> Doing it this way allows us to
> * avoid touching freelists altogether in the common path - we know we
> are about to reassign the entry, so we do remember it - no contention
> from other backends, no borrowing etc..
> * avoid sharing the private details outside of the dynahash module
> * allows us to use the same technique elsewhere that we have
> partitioned hash tables
> 
> This approach is cleaner than v1, but should also perform better
> because there will be a 1:1 relationship between a buffer and its
> dynahash entry, most of the time.

Thank you for suggestion. Yes, it is much clearer than my initial proposal.

Should I incorporate it to v4 patch? Perhaps, it could be a separate
commit in new version.


> 
> With these changes, I think we will be able to *reduce* the number of
> freelists for partitioned dynahash from 32 to maybe 8, as originally
> speculated by Robert in 2016:
>    https://www.postgresql.org/message-id/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
> since the freelists will be much less contended with the above approach
> 
> It would be useful to see performance with a higher number of connections, >400.
> 
> --
> Simon Riggs                http://www.EnterpriseDB.com/

------

regards,
Yura Sokolov

Re: BufferAlloc: don't take two simultaneous locks

От

Simon Riggs

Дата:

25 февраля 2022 г., 12:38:36

On Fri, 25 Feb 2022 at 09:24, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

> > This approach is cleaner than v1, but should also perform better
> > because there will be a 1:1 relationship between a buffer and its
> > dynahash entry, most of the time.
>
> Thank you for suggestion. Yes, it is much clearer than my initial proposal.
>
> Should I incorporate it to v4 patch? Perhaps, it could be a separate
> commit in new version.

I don't insist that you do that, but since the API changes are a few
hours work ISTM better to include in one patch for combined perf
testing. It would be better to put all changes in this area into PG15
than to split it across multiple releases.

> Why there is need for this? Which way backend could be forced to abort
> between BufTableReuse and BufTableAssign in this code path? I don't
> see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
> something.

Sounds reasonable.

-- 
Simon Riggs                http://www.EnterpriseDB.com/

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

25 февраля 2022 г., 12:51:22

Hello, Andres

В Пт, 25/02/2022 в 00:04 -0800, Andres Freund пишет:
> Hi,
> 
> On 2022-02-21 11:06:49 +0300, Yura Sokolov wrote:
> > From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
> > From: Yura Sokolov <y.sokolov@postgrespro.ru>
> > Date: Mon, 21 Feb 2022 08:49:03 +0300
> > Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.
> > 
> > Acquiring two partition locks leads to complex dependency chain that hurts
> > at high concurrency level.
> > 
> > There is no need to hold both lock simultaneously. Buffer is pinned so
> > other processes could not select it for eviction. If tag is cleared and
> > buffer removed from old partition other processes will not find it.
> > Therefore it is safe to release old partition lock before acquiring
> > new partition lock.
> 
> Yes, the current design is pretty nonsensical. It leads to really absurd stuff
> like holding the relation extension lock while we write out old buffer
> contents etc.
> 
> 
> 
> > +     * We have pinned buffer and we are single pinner at the moment so there
> > +     * is no other pinners.
> 
> Seems redundant.
> 
> 
> > We hold buffer header lock and exclusive partition
> > +     * lock if tag is valid. Given these statements it is safe to clear tag
> > +     * since no other process can inspect it to the moment.
> > +     */
> 
> Could we share code with InvalidateBuffer here? It's not quite the same code,
> but nearly the same.
> 
> 
> > +     * The usage_count starts out at 1 so that the buffer can survive one
> > +     * clock-sweep pass.
> > +     *
> > +     * We use direct atomic OR instead of Lock+Unlock since no other backend
> > +     * could be interested in the buffer. But StrategyGetBuffer,
> > +     * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
> > +     * compare tag, and UnlockBufHdr does raw write to state. So we have to
> > +     * spin if we found buffer locked.
> 
> So basically the first half of of the paragraph is wrong, because no, we
> can't?

Logically, there are no backends that could be interesting in the buffer.
Physically they do LockBufHdr/UnlockBufHdr just to check they are not interesting.

> > +     * Note that we write tag unlocked. It is also safe since there is always
> > +     * check for BM_VALID when tag is compared.
> 
> 
> >       */
> >      buf->tag = newTag;
> > -    buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
> > -                   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
> > -                   BUF_USAGECOUNT_MASK);
> >      if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
> > -        buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > +        new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> >      else
> > -        buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > -
> > -    UnlockBufHdr(buf, buf_state);
> > +        new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> >  
> > -    if (oldPartitionLock != NULL)
> > +    buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
> > +    while (unlikely(buf_state & BM_LOCKED))
> 
> I don't think it's safe to atomic in arbitrary bits. If somebody else has
> locked the buffer header in this moment, it'll lead to completely bogus
> results, because unlocking overwrites concurrently written contents (which
> there shouldn't be any, but here there are)...

That is why there is safety loop in the case buf->state were locked just
after first optimistic atomic_fetch_or. 99.999% times this loop will not
have a job. But in case other backend did lock buf->state, loop waits
until it releases lock and retry atomic_fetch_or.

> And or'ing contents in also doesn't make sense because we it doesn't work to
> actually unset any contents?

Sorry, I didn't understand sentence :((

> Why don't you just use LockBufHdr/UnlockBufHdr?

This pair makes two atomic writes to memory. Two writes are heavier than
one write in this version (if optimistic case succeed).

But I thought to use Lock+UnlockBuhHdr instead of safety loop:

    buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
    if (unlikely(buf_state & BM_LOCKED))
    {
        buf_state = LockBufHdr(&buf->state);
        UnlockBufHdr(&buf->state, buf_state | new_bits);
    }

I agree this way code is cleaner. Will do in next version.

-----

regards,
Yura Sokolov

Re: BufferAlloc: don't take two simultaneous locks

От

Andres Freund

Дата:

25 февраля 2022 г., 20:01:27

Hi,

On 2022-02-25 12:51:22 +0300, Yura Sokolov wrote:
> > > +     * The usage_count starts out at 1 so that the buffer can survive one
> > > +     * clock-sweep pass.
> > > +     *
> > > +     * We use direct atomic OR instead of Lock+Unlock since no other backend
> > > +     * could be interested in the buffer. But StrategyGetBuffer,
> > > +     * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
> > > +     * compare tag, and UnlockBufHdr does raw write to state. So we have to
> > > +     * spin if we found buffer locked.
> > 
> > So basically the first half of of the paragraph is wrong, because no, we
> > can't?
> 
> Logically, there are no backends that could be interesting in the buffer.
> Physically they do LockBufHdr/UnlockBufHdr just to check they are not interesting.

Yea, but that's still being interested in the buffer...


> > > +     * Note that we write tag unlocked. It is also safe since there is always
> > > +     * check for BM_VALID when tag is compared.
> > 
> > 
> > >       */
> > >      buf->tag = newTag;
> > > -    buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
> > > -                   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
> > > -                   BUF_USAGECOUNT_MASK);
> > >      if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
> > > -        buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > > +        new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > >      else
> > > -        buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > > -
> > > -    UnlockBufHdr(buf, buf_state);
> > > +        new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > >  
> > > -    if (oldPartitionLock != NULL)
> > > +    buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
> > > +    while (unlikely(buf_state & BM_LOCKED))
> > 
> > I don't think it's safe to atomic in arbitrary bits. If somebody else has
> > locked the buffer header in this moment, it'll lead to completely bogus
> > results, because unlocking overwrites concurrently written contents (which
> > there shouldn't be any, but here there are)...
> 
> That is why there is safety loop in the case buf->state were locked just
> after first optimistic atomic_fetch_or. 99.999% times this loop will not
> have a job. But in case other backend did lock buf->state, loop waits
> until it releases lock and retry atomic_fetch_or.

> > And or'ing contents in also doesn't make sense because we it doesn't work to
> > actually unset any contents?
> 
> Sorry, I didn't understand sentence :((


You're OR'ing multiple bits into buf->state. LockBufHdr() only ORs in
BM_LOCKED. ORing BM_LOCKED is fine:
Either the buffer is not already locked, in which case it just sets the
BM_LOCKED bit, acquiring the lock. Or it doesn't change anything, because
BM_LOCKED already was set.

But OR'ing in multiple bits is *not* fine, because it'll actually change the
contents of ->state while the buffer header is locked.


> > Why don't you just use LockBufHdr/UnlockBufHdr?
> 
> This pair makes two atomic writes to memory. Two writes are heavier than
> one write in this version (if optimistic case succeed).

UnlockBufHdr doesn't use a locked atomic op. It uses a write barrier and an
unlocked write.

Greetings,

Andres Freund

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

28 февраля 2022 г., 09:01:49

В Пт, 25/02/2022 в 09:01 -0800, Andres Freund пишет:
> Hi,
> 
> On 2022-02-25 12:51:22 +0300, Yura Sokolov wrote:
> > > > +     * The usage_count starts out at 1 so that the buffer can survive one
> > > > +     * clock-sweep pass.
> > > > +     *
> > > > +     * We use direct atomic OR instead of Lock+Unlock since no other backend
> > > > +     * could be interested in the buffer. But StrategyGetBuffer,
> > > > +     * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
> > > > +     * compare tag, and UnlockBufHdr does raw write to state. So we have to
> > > > +     * spin if we found buffer locked.
> > > 
> > > So basically the first half of of the paragraph is wrong, because no, we
> > > can't?
> > 
> > Logically, there are no backends that could be interesting in the buffer.
> > Physically they do LockBufHdr/UnlockBufHdr just to check they are not interesting.
> 
> Yea, but that's still being interested in the buffer...
> 
> 
> > > > +     * Note that we write tag unlocked. It is also safe since there is always
> > > > +     * check for BM_VALID when tag is compared.
> > > >       */
> > > >      buf->tag = newTag;
> > > > -    buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
> > > > -                   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
> > > > -                   BUF_USAGECOUNT_MASK);
> > > >      if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
> > > > -        buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > > > +        new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > > >      else
> > > > -        buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > > > -
> > > > -    UnlockBufHdr(buf, buf_state);
> > > > +        new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > > >  
> > > > -    if (oldPartitionLock != NULL)
> > > > +    buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
> > > > +    while (unlikely(buf_state & BM_LOCKED))
> > > 
> > > I don't think it's safe to atomic in arbitrary bits. If somebody else has
> > > locked the buffer header in this moment, it'll lead to completely bogus
> > > results, because unlocking overwrites concurrently written contents (which
> > > there shouldn't be any, but here there are)...
> > 
> > That is why there is safety loop in the case buf->state were locked just
> > after first optimistic atomic_fetch_or. 99.999% times this loop will not
> > have a job. But in case other backend did lock buf->state, loop waits
> > until it releases lock and retry atomic_fetch_or.
> > > And or'ing contents in also doesn't make sense because we it doesn't work to
> > > actually unset any contents?
> > 
> > Sorry, I didn't understand sentence :((
> 
> You're OR'ing multiple bits into buf->state. LockBufHdr() only ORs in
> BM_LOCKED. ORing BM_LOCKED is fine:
> Either the buffer is not already locked, in which case it just sets the
> BM_LOCKED bit, acquiring the lock. Or it doesn't change anything, because
> BM_LOCKED already was set.
> 
> But OR'ing in multiple bits is *not* fine, because it'll actually change the
> contents of ->state while the buffer header is locked.

First, both states are valid: before atomic_or and after.
Second, there are no checks for buffer->state while buffer header is locked.
All LockBufHdr users uses result of LockBufHdr. (I just checked that).

> > > Why don't you just use LockBufHdr/UnlockBufHdr?
> > 
> > This pair makes two atomic writes to memory. Two writes are heavier than
> > one write in this version (if optimistic case succeed).
> 
> UnlockBufHdr doesn't use a locked atomic op. It uses a write barrier and an
> unlocked write.

Write barrier is not free on any platform.

Well, while I don't see problem with modifying buffer->state, there is problem
with modifying buffer->tag: I missed Drop*Buffers doesn't check BM_TAG_VALID
flag. Therefore either I had to add this check to those places, or return to
LockBufHdr+UnlockBufHdr pair.

For patch simplicity I'll return Lock+UnlockBufHdr pair. But it has measurable
impact on low connection numbers on many-sockets.

> 
> Greetings,
> 
> Andres Freund

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

01 марта 2022 г., 10:24:22

В Пт, 25/02/2022 в 09:38 +0000, Simon Riggs пишет:
> On Fri, 25 Feb 2022 at 09:24, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> 
> > > This approach is cleaner than v1, but should also perform better
> > > because there will be a 1:1 relationship between a buffer and its
> > > dynahash entry, most of the time.
> > 
> > Thank you for suggestion. Yes, it is much clearer than my initial proposal.
> > 
> > Should I incorporate it to v4 patch? Perhaps, it could be a separate
> > commit in new version.
> 
> I don't insist that you do that, but since the API changes are a few
> hours work ISTM better to include in one patch for combined perf
> testing. It would be better to put all changes in this area into PG15
> than to split it across multiple releases.
> 
> > Why there is need for this? Which way backend could be forced to abort
> > between BufTableReuse and BufTableAssign in this code path? I don't
> > see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
> > something.
> 
> Sounds reasonable.

Ok, here is v4.
It is with two commits: one for BufferAlloc locking change and other
for dynahash's freelist avoiding.

Buffer locking patch is same to v2 with some comment changes. Ie it uses
Lock+UnlockBufHdr 

For dynahash HASH_REUSE and HASH_ASSIGN as suggested.
HASH_REUSE stores deleted element into per-process static variable.
HASH_ASSIGN uses this element instead of freelist. If there's no
such stored element, it falls back to HASH_ENTER.

I've implemented Robert Haas's suggestion to count element in freelists
instead of nentries:

> One idea is to jigger things so that we maintain a count of the total
> number of entries that doesn't change except when we allocate, and
> then for each freelist partition we maintain the number of entries in
> that freelist partition.  So then the size of the hash table, instead
> of being sum(nentries) is totalsize - sum(nfree).

https://postgr.es/m/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com

It helps to avoid freelist lock just to actualize counters.
I made it with replacing "nentries" with "nfree" and adding
"nalloced" to each freelist. It also makes "hash_update_hash_key" valid
for key that migrates partitions.

I believe, there is no need for "nalloced" for each freelist, and
instead single such field should be in HASHHDR. More, it seems to me
`element_alloc` function needs no acquiring freelist partition lock
since it is called only during initialization of shared hash table.
Am I right?

I didn't go this path in v4 for simplicity, but can put it to v5
if approved.

To be honest, "reuse" patch gives little improvement. But still
measurable on some connection numbers.

I tried to reduce freelist partitions to 8, but it has mixed impact.
Most of time performance is same, but sometimes a bit lower. I
didn't investigate reasons. Perhaps they are not related to buffer
manager.

I didn't introduce new functions BufTableReuse and BufTableAssign
since there are single call to BufTableInsert and two calls to
BufTableDelete. So I reused this functions, just added "reuse" flag
to BufTableDelete. 

Tests simple_select for Xeon 8354H, 128MB and 1G shared buffers
for scale 100.

1 socket:
  conns |     master |   patch_v4 |  master 1G | patch_v4 1G 
--------+------------+------------+------------+------------
      1 |      41975 |      41540 |      52898 |      52213 
      2 |      77693 |      77908 |      97571 |      98371 
      3 |     114713 |     115522 |     142709 |     145226 
      5 |     188898 |     187617 |     239322 |     237269 
      7 |     261516 |     260006 |     329119 |     329449 
     17 |     521821 |     519473 |     672390 |     662106 
     27 |     555487 |     555697 |     674630 |     672736 
     53 |     868213 |     896539 |    1190734 |    1202505 
     83 |     868232 |     866029 |    1164997 |    1158719 
    107 |     850477 |     845685 |    1140597 |    1134502 
    139 |     816311 |     816808 |    1101471 |    1091258 
    163 |     794788 |     796517 |    1078445 |    1071568 
    191 |     765934 |     776185 |    1059497 |    1041944 
    211 |     738656 |     777365 |    1083356 |    1046422 
    239 |     713124 |     841337 |    1104629 |    1116668 
    271 |     692138 |     847803 |    1094432 |    1128971 
    307 |     682919 |     849239 |    1086306 |    1127051 
    353 |     679449 |     842125 |    1071482 |    1117471 
    397 |     676217 |     844015 |    1058937 |    1118628 

2 sockets:
  conns |     master |   patch_v4 |  master 1G | patch_v4 1G 
--------+------------+------------+------------+------------
      1 |      44317 |      44034 |      53920 |      53583 
      2 |      81193 |      78621 |      99138 |      97968 
      3 |     120755 |     115648 |     148102 |     147423 
      5 |     190007 |     188943 |     232078 |     231029 
      7 |     258602 |     260649 |     325545 |     318567 
     17 |     551814 |     552914 |     692312 |     697518 
     27 |     787353 |     786573 |    1023509 |    1022891 
     53 |     973880 |    1008534 |    1228274 |    1278194 
     83 |    1108442 |    1269777 |    1596292 |    1648156 
    107 |    1072188 |    1339634 |    1542401 |    1664476 
    139 |    1000446 |    1316372 |    1490757 |    1676127 
    163 |     967378 |    1257445 |    1461468 |    1655574 
    191 |     926010 |    1189591 |    1435317 |    1639313 
    211 |     909919 |    1149905 |    1417437 |    1632764 
    239 |     895944 |    1115681 |    1393530 |    1616329 
    271 |     880545 |    1090208 |    1374878 |    1609544 
    307 |     865560 |    1066798 |    1355164 |    1593769 
    353 |     857591 |    1046426 |    1330069 |    1584006 
    397 |     840374 |    1024711 |    1312257 |    1564872 

--------

regards

Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Вложения

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

03 марта 2022 г., 01:35:57

В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:
> Ok, here is v4.

And here is v5.

First, there was compilation error in Assert in dynahash.c .
Excuse me for not checking before sending previous version.

Second, I add third commit that reduces HASHHDR allocation
size for non-partitioned dynahash:
- moved freeList to last position
- alloc and memset offset(HASHHDR, freeList[1]) for
  non-partitioned hash tables.
I didn't benchmarked it, but I will be surprised if it
matters much in performance sence.

Third, I put all three commits into single file to not
confuse commitfest application.

 
--------

regards

Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Вложения

v5-bufmgr-lock-improvements.patch

Re: BufferAlloc: don't take two simultaneous locks

От

Kyotaro Horiguchi

Дата:

11 марта 2022 г., 09:30:30

At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in
> В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:
> > Ok, here is v4.
>
> And here is v5.
>
> First, there was compilation error in Assert in dynahash.c .
> Excuse me for not checking before sending previous version.
>
> Second, I add third commit that reduces HASHHDR allocation
> size for non-partitioned dynahash:
> - moved freeList to last position
> - alloc and memset offset(HASHHDR, freeList[1]) for
>   non-partitioned hash tables.
> I didn't benchmarked it, but I will be surprised if it
> matters much in performance sence.
>
> Third, I put all three commits into single file to not
> confuse commitfest application.

Thanks!  I looked into dynahash part.

 struct HASHHDR
 {
-    /*
-     * The freelist can become a point of contention in high-concurrency hash

Why did you move around the freeList?


-    long        nentries;        /* number of entries in associated buckets */
+    long        nfree;            /* number of free entries in the list */
+    long        nalloced;        /* number of entries initially allocated for

Why do we need nfree?  HASH_ASSING should do the same thing with
HASH_REMOVE.  Maybe the reason is the code tries to put the detached
bucket to different free list, but we can just remember the
freelist_idx for the detached bucket as we do for hashp.  I think that
should largely reduce the footprint of this patch.

-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);

That optimization may work even a bit, but it is not irrelevant to
this patch?

+        case HASH_REUSE:
+            if (currBucket != NULL)
+            {
+                /* check there is no unfinished HASH_REUSE+HASH_ASSIGN pair */
+                Assert(DynaHashReuse.hashp == NULL);
+                Assert(DynaHashReuse.element == NULL);

I think all cases in the switch(action) other than HASH_ASSIGN needs
this assertion and no need for checking both, maybe only for element
would be enough.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

От

Kyotaro Horiguchi

Дата:

11 марта 2022 г., 09:49:49

At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Thanks!  I looked into dynahash part.
> 
>  struct HASHHDR
>  {
> -    /*
> -     * The freelist can become a point of contention in high-concurrency hash
> 
> Why did you move around the freeList?
> 
> 
> -    long        nentries;        /* number of entries in associated buckets */
> +    long        nfree;            /* number of free entries in the list */
> +    long        nalloced;        /* number of entries initially allocated for
> 
> Why do we need nfree?  HASH_ASSING should do the same thing with
> HASH_REMOVE.  Maybe the reason is the code tries to put the detached
> bucket to different free list, but we can just remember the
> freelist_idx for the detached bucket as we do for hashp.  I think that
> should largely reduce the footprint of this patch.
> 
> -static void hdefault(HTAB *hashp);
> +static void hdefault(HTAB *hashp, bool partitioned);
> 
> That optimization may work even a bit, but it is not irrelevant to
> this patch?
> 
> +        case HASH_REUSE:
> +            if (currBucket != NULL)
> +            {
> +                /* check there is no unfinished HASH_REUSE+HASH_ASSIGN pair */
> +                Assert(DynaHashReuse.hashp == NULL);
> +                Assert(DynaHashReuse.element == NULL);
> 
> I think all cases in the switch(action) other than HASH_ASSIGN needs
> this assertion and no need for checking both, maybe only for element
> would be enough.

While I looked buf_table part, I came up with additional comments.

BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
{
        hash_search_with_hash_value(SharedBufHash,
                                    HASH_ASSIGN,
...
BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)

BufTableDelete considers both reuse and !reuse cases but
BufTableInsert doesn't and always does HASH_ASSIGN.  That looks
odd. We should use HASH_ENTER here.  Thus I think it is more
reasonable that HASH_ENTRY uses the stashed entry if exists and
needed, or returns it to freelist if exists but not needed.

What do you think about this?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

От

Kyotaro Horiguchi

Дата:

11 марта 2022 г., 11:21:37

At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > Thanks!  I looked into dynahash part.

Then I looked into bufmgr part.  It looks fine to me but I have some
comments on code comments.

>         * To change the association of a valid buffer, we'll need to have
>         * exclusive lock on both the old and new mapping partitions.
>        if (oldFlags & BM_TAG_VALID)

We don't take lock on the new mapping partition here.


+     * Clear out the buffer's tag and flags.  We must do this to ensure that
+     * linear scans of the buffer array don't think the buffer is valid. We
+     * also reset the usage_count since any recency of use of the old content
+     * is no longer relevant.
+    *
+     * We are single pinner, we hold buffer header lock and exclusive
+     * partition lock (if tag is valid). Given these statements it is safe to
+     * clear tag since no other process can inspect it to the moment.

This comment is a merger of the comments from InvalidateBuffer and
BufferAlloc.  But I think what we need to explain here is why we
invalidate the buffer here despite of we are going to reuse it soon.
And I think we need to state that the old buffer is now safe to use
for the new tag here.  I'm not sure the statement is really correct
but clearing-out actually looks like safer.

> Now it is safe to use victim buffer for new tag.  Invalidate the
> buffer before releasing header lock to ensure that linear scans of
> the buffer array don't think the buffer is valid.  It is safe
> because it is guaranteed that we're the single pinner of the buffer.
> That pin also prevents the buffer from being stolen by others until
> we reuse it or return it to freelist.

So I want to revise the following comment.

-     * Now it is safe to use victim buffer for new tag.
+     * Now reuse victim buffer for new tag.
>     * Make sure BM_PERMANENT is set for buffers that must be written at every
>     * checkpoint.  Unlogged buffers only need to be written at shutdown
>     * checkpoints, except for their "init" forks, which need to be treated
>     * just like permanent relations.
>     *
>     * The usage_count starts out at 1 so that the buffer can survive one
>     * clock-sweep pass.

But if you think the current commet is fine, I don't insist on the
comment chagnes.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

11 марта 2022 г., 11:30:27

В Пт, 11/03/2022 в 15:30 +0900, Kyotaro Horiguchi пишет:
> At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in 
> > В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:
> > > Ok, here is v4.
> > 
> > And here is v5.
> > 
> > First, there was compilation error in Assert in dynahash.c .
> > Excuse me for not checking before sending previous version.
> > 
> > Second, I add third commit that reduces HASHHDR allocation
> > size for non-partitioned dynahash:
> > - moved freeList to last position
> > - alloc and memset offset(HASHHDR, freeList[1]) for
> >   non-partitioned hash tables.
> > I didn't benchmarked it, but I will be surprised if it
> > matters much in performance sence.
> > 
> > Third, I put all three commits into single file to not
> > confuse commitfest application.
> 
> Thanks!  I looked into dynahash part.
> 
>  struct HASHHDR
>  {
> -       /*
> -        * The freelist can become a point of contention in high-concurrency hash
> 
> Why did you move around the freeList?
> 
> 
> -       long            nentries;               /* number of entries in associated buckets */
> +       long            nfree;                  /* number of free entries in the list */
> +       long            nalloced;               /* number of entries initially allocated for
> 
> Why do we need nfree?  HASH_ASSING should do the same thing with
> HASH_REMOVE.  Maybe the reason is the code tries to put the detached
> bucket to different free list, but we can just remember the
> freelist_idx for the detached bucket as we do for hashp.  I think that
> should largely reduce the footprint of this patch.

If we keep nentries, then we need to fix nentries in both old
"freeList" partition and new one. It is two freeList[partition]->mutex
lock+unlock pairs.

But count of free elements doesn't change, so if we change nentries
to nfree, then no need to fix freeList[partition]->nfree counters,
no need to lock+unlock. 

> 
> -static void hdefault(HTAB *hashp);
> +static void hdefault(HTAB *hashp, bool partitioned);
> 
> That optimization may work even a bit, but it is not irrelevant to
> this patch?
> 
> +               case HASH_REUSE:
> +                       if (currBucket != NULL)
> +                       {
> +                               /* check there is no unfinished HASH_REUSE+HASH_ASSIGN pair */
> +                               Assert(DynaHashReuse.hashp == NULL);
> +                               Assert(DynaHashReuse.element == NULL);
> 
> I think all cases in the switch(action) other than HASH_ASSIGN needs
> this assertion and no need for checking both, maybe only for element
> would be enough.

Agree.

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

11 марта 2022 г., 12:34:32

В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:
> At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > Thanks!  I looked into dynahash part.
> > 
> >  struct HASHHDR
> >  {
> > -     /*
> > -      * The freelist can become a point of contention in high-concurrency hash
> > 
> > Why did you move around the freeList?
> > 
> > 
> > -     long            nentries;               /* number of entries in associated buckets */
> > +     long            nfree;                  /* number of free entries in the list */
> > +     long            nalloced;               /* number of entries initially allocated for
> > 
> > Why do we need nfree?  HASH_ASSING should do the same thing with
> > HASH_REMOVE.  Maybe the reason is the code tries to put the detached
> > bucket to different free list, but we can just remember the
> > freelist_idx for the detached bucket as we do for hashp.  I think that
> > should largely reduce the footprint of this patch.
> > 
> > -static void hdefault(HTAB *hashp);
> > +static void hdefault(HTAB *hashp, bool partitioned);
> > 
> > That optimization may work even a bit, but it is not irrelevant to
> > this patch?

(forgot to answer in previous letter).
Yes, third commit is very optional. But adding `nalloced` to
`FreeListData` increases allocation a lot even for usual
non-shared non-partitioned dynahashes. And this allocation is
quite huge right now for no meaningful reason.

> > 
> > +             case HASH_REUSE:
> > +                     if (currBucket != NULL)
> > +                     {
> > +                             /* check there is no unfinished HASH_REUSE+HASH_ASSIGN pair */
> > +                             Assert(DynaHashReuse.hashp == NULL);
> > +                             Assert(DynaHashReuse.element == NULL);
> > 
> > I think all cases in the switch(action) other than HASH_ASSIGN needs
> > this assertion and no need for checking both, maybe only for element
> > would be enough.
> 
> While I looked buf_table part, I came up with additional comments.
> 
> BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
> {
>                 hash_search_with_hash_value(SharedBufHash,
>                                                                         HASH_ASSIGN,
> ...
> BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
> 
> BufTableDelete considers both reuse and !reuse cases but
> BufTableInsert doesn't and always does HASH_ASSIGN.  That looks
> odd. We should use HASH_ENTER here.  Thus I think it is more
> reasonable that HASH_ENTRY uses the stashed entry if exists and
> needed, or returns it to freelist if exists but not needed.
> 
> What do you think about this?

Well... I don't like it but I don't mind either.

Code in HASH_ENTER and HASH_ASSIGN cases differs much.
On the other hand, probably it is possible to merge it carefuly.
I'll try.

---------

regards

Yura Sokolov

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

13 марта 2022 г., 13:24:51

В Пт, 11/03/2022 в 17:21 +0900, Kyotaro Horiguchi пишет:
> At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > > Thanks!  I looked into dynahash part.
> > > 
> > >  struct HASHHDR
> > >  {
> > > -       /*
> > > -        * The freelist can become a point of contention in high-concurrency hash
> > > 
> > > Why did you move around the freeList?

This way it is possible to allocate just first partition, not all 32 partitions.

> 
> Then I looked into bufmgr part.  It looks fine to me but I have some
> comments on code comments.
> 
> >                * To change the association of a valid buffer, we'll need to have
> >                * exclusive lock on both the old and new mapping partitions.
> >               if (oldFlags & BM_TAG_VALID)
> 
> We don't take lock on the new mapping partition here.

Thx, fixed.

> +        * Clear out the buffer's tag and flags.  We must do this to ensure that
> +        * linear scans of the buffer array don't think the buffer is valid. We
> +        * also reset the usage_count since any recency of use of the old content
> +        * is no longer relevant.
> +    *
> +        * We are single pinner, we hold buffer header lock and exclusive
> +        * partition lock (if tag is valid). Given these statements it is safe to
> +        * clear tag since no other process can inspect it to the moment.
> 
> This comment is a merger of the comments from InvalidateBuffer and
> BufferAlloc.  But I think what we need to explain here is why we
> invalidate the buffer here despite of we are going to reuse it soon.
> And I think we need to state that the old buffer is now safe to use
> for the new tag here.  I'm not sure the statement is really correct
> but clearing-out actually looks like safer.

I've tried to reformulate the comment block.

> 
> > Now it is safe to use victim buffer for new tag.  Invalidate the
> > buffer before releasing header lock to ensure that linear scans of
> > the buffer array don't think the buffer is valid.  It is safe
> > because it is guaranteed that we're the single pinner of the buffer.
> > That pin also prevents the buffer from being stolen by others until
> > we reuse it or return it to freelist.
> 
> So I want to revise the following comment.
> 
> -        * Now it is safe to use victim buffer for new tag.
> +        * Now reuse victim buffer for new tag.
> >        * Make sure BM_PERMANENT is set for buffers that must be written at every
> >        * checkpoint.  Unlogged buffers only need to be written at shutdown
> >        * checkpoints, except for their "init" forks, which need to be treated
> >        * just like permanent relations.
> >        *
> >        * The usage_count starts out at 1 so that the buffer can survive one
> >        * clock-sweep pass.
> 
> But if you think the current commet is fine, I don't insist on the
> comment chagnes.

Used suggestion.

Fr, 11/03/22 Yura Sokolov wrote:
> В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:
> > BufTableDelete considers both reuse and !reuse cases but
> > BufTableInsert doesn't and always does HASH_ASSIGN.  That looks
> > odd. We should use HASH_ENTER here.  Thus I think it is more
> > reasonable that HASH_ENTRY uses the stashed entry if exists and
> > needed, or returns it to freelist if exists but not needed.
> > 
> > What do you think about this?
> 
> Well... I don't like it but I don't mind either.
> 
> Code in HASH_ENTER and HASH_ASSIGN cases differs much.
> On the other hand, probably it is possible to merge it carefuly.
> I'll try.

I've merged HASH_ASSIGN into HASH_ENTER.

As in previous letter, three commits are concatted to one file
and could be applied with `git am`.

-------

regards

Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Вложения

v6-bufmgr-lock-improvements.patch

Re: BufferAlloc: don't take two simultaneous locks

От

Zhihong Yu

Дата:

13 марта 2022 г., 17:05:10

On Sun, Mar 13, 2022 at 3:25 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

В Пт, 11/03/2022 в 17:21 +0900, Kyotaro Horiguchi пишет:
> At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > > Thanks! I looked into dynahash part.
> > >
> > > struct HASHHDR
> > > {
> > > - /*
> > > - * The freelist can become a point of contention in high-concurrency hash
> > >
> > > Why did you move around the freeList?

This way it is possible to allocate just first partition, not all 32 partitions.

>
> Then I looked into bufmgr part. It looks fine to me but I have some
> comments on code comments.
>
> > * To change the association of a valid buffer, we'll need to have
> > * exclusive lock on both the old and new mapping partitions.
> > if (oldFlags & BM_TAG_VALID)
>
> We don't take lock on the new mapping partition here.

Thx, fixed.

> + * Clear out the buffer's tag and flags. We must do this to ensure that
> + * linear scans of the buffer array don't think the buffer is valid. We
> + * also reset the usage_count since any recency of use of the old content
> + * is no longer relevant.
> + *
> + * We are single pinner, we hold buffer header lock and exclusive
> + * partition lock (if tag is valid). Given these statements it is safe to
> + * clear tag since no other process can inspect it to the moment.
>
> This comment is a merger of the comments from InvalidateBuffer and
> BufferAlloc. But I think what we need to explain here is why we
> invalidate the buffer here despite of we are going to reuse it soon.
> And I think we need to state that the old buffer is now safe to use
> for the new tag here. I'm not sure the statement is really correct
> but clearing-out actually looks like safer.

I've tried to reformulate the comment block.

>
> > Now it is safe to use victim buffer for new tag. Invalidate the
> > buffer before releasing header lock to ensure that linear scans of
> > the buffer array don't think the buffer is valid. It is safe
> > because it is guaranteed that we're the single pinner of the buffer.
> > That pin also prevents the buffer from being stolen by others until
> > we reuse it or return it to freelist.
>
> So I want to revise the following comment.
>
> - * Now it is safe to use victim buffer for new tag.
> + * Now reuse victim buffer for new tag.
> > * Make sure BM_PERMANENT is set for buffers that must be written at every
> > * checkpoint. Unlogged buffers only need to be written at shutdown
> > * checkpoints, except for their "init" forks, which need to be treated
> > * just like permanent relations.
> > *
> > * The usage_count starts out at 1 so that the buffer can survive one
> > * clock-sweep pass.
>
> But if you think the current commet is fine, I don't insist on the
> comment chagnes.

Used suggestion.

Fr, 11/03/22 Yura Sokolov wrote:
> В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:
> > BufTableDelete considers both reuse and !reuse cases but
> > BufTableInsert doesn't and always does HASH_ASSIGN. That looks
> > odd. We should use HASH_ENTER here. Thus I think it is more
> > reasonable that HASH_ENTRY uses the stashed entry if exists and
> > needed, or returns it to freelist if exists but not needed.
> >
> > What do you think about this?
>
> Well... I don't like it but I don't mind either.
>
> Code in HASH_ENTER and HASH_ASSIGN cases differs much.
> On the other hand, probably it is possible to merge it carefuly.
> I'll try.

I've merged HASH_ASSIGN into HASH_ENTER.

As in previous letter, three commits are concatted to one file
and could be applied with `git am`.

-------

regards

Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Hi,

In the description:

There is no need to hold both lock simultaneously.

both lock -> both locks

+ * We also reset the usage_count since any recency of use of the old

recency of use -> recent use

+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)

Later on, there is code:

+ reuse ? HASH_REUSE : HASH_REMOVE,

Can flag (such as HASH_REUSE) be passed to BufTableDelete() instead of bool ? That way, flag can be used directly in the above place.

+ long nalloced; /* number of entries initially allocated for

nallocated isn't very long. I think it would be better to name the field nallocated 'nallocated'.

+ sum += hashp->hctl->freeList[i].nalloced;
+ sum -= hashp->hctl->freeList[i].nfree;

I think it would be better to calculate the difference between nalloced and nfree first, then add the result to sum (to avoid overflow).

Subject: [PATCH 3/3] reduce memory allocation for non-partitioned dynahash

memory allocation -> memory allocations

Cheers

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

14 марта 2022 г., 01:27:47

В Вс, 13/03/2022 в 07:05 -0700, Zhihong Yu пишет:
> 
> Hi,
> In the description:
> 
> There is no need to hold both lock simultaneously. 
> 
> both lock -> both locks

Thanks.

> +    * We also reset the usage_count since any recency of use of the old
> 
> recency of use -> recent use

Thanks.

> +BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
> 
> Later on, there is code:
> 
> +                                   reuse ? HASH_REUSE : HASH_REMOVE,
> 
> Can flag (such as HASH_REUSE) be passed to BufTableDelete() instead of bool ? That way, flag can be used directly in
theabove place.
 

No.
BufTable* functions are created to abstract Buffer Table from dynahash.
Pass of HASH_REUSE directly will break abstraction.

> +   long        nalloced;       /* number of entries initially allocated for
> 
> nallocated isn't very long. I think it would be better to name the field nallocated 'nallocated'.

It is debatable.
Why not num_allocated? allocated_count? number_of_allocations?
Same points for nfree.
`nalloced` is recognizable and unambiguous. And there are a lot
of `*alloced` in the postgresql's source, so this one will not
be unusual.

I don't see the need to make it longer.

But if someone supports your point, I will not mind to changing
the name.

> +           sum += hashp->hctl->freeList[i].nalloced;
> +           sum -= hashp->hctl->freeList[i].nfree;
> 
> I think it would be better to calculate the difference between nalloced and nfree first, then add the result to sum
(toavoid overflow).
 

Doesn't really matter much, because calculation must be valid
even if all nfree==0.

I'd rather debate use of 'long' in dynahash at all: 'long' is
32bit on 64bit Windows. It is better to use 'Size' here.

But 'nelements' were 'long', so I didn't change things. I think
it is place for another patch.

(On the other hand, dynahash with 2**31 elements is at least
512GB RAM... we doubtfully trigger problem before OOM killer
came. Does Windows have an OOM killer?)

> Subject: [PATCH 3/3] reduce memory allocation for non-partitioned dynahash
> 
> memory allocation -> memory allocations

For each dynahash instance single allocation were reduced.
I think, 'memory allocation' is correct.

Plural will be
    reduce memory allocations for non-partitioned dynahashes
ie both 'allocations' and 'dynahashes'.
Am I wrong?


------

regards
Yura Sokolov

Вложения

v7-bufmgr-lock-improvements.patch

Re: BufferAlloc: don't take two simultaneous locks

От

Zhihong Yu

Дата:

14 марта 2022 г., 01:40:18

On Sun, Mar 13, 2022 at 3:27 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

В Вс, 13/03/2022 в 07:05 -0700, Zhihong Yu пишет:
>
> Hi,
> In the description:
>
> There is no need to hold both lock simultaneously.
>
> both lock -> both locks

Thanks.

> + * We also reset the usage_count since any recency of use of the old
>
> recency of use -> recent use

Thanks.

> +BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
>
> Later on, there is code:
>
> + reuse ? HASH_REUSE : HASH_REMOVE,
>
> Can flag (such as HASH_REUSE) be passed to BufTableDelete() instead of bool ? That way, flag can be used directly in the above place.

No.
BufTable* functions are created to abstract Buffer Table from dynahash.
Pass of HASH_REUSE directly will break abstraction.

> + long nalloced; /* number of entries initially allocated for
>
> nallocated isn't very long. I think it would be better to name the field nallocated 'nallocated'.

It is debatable.
Why not num_allocated? allocated_count? number_of_allocations?
Same points for nfree.
`nalloced` is recognizable and unambiguous. And there are a lot
of `*alloced` in the postgresql's source, so this one will not
be unusual.

I don't see the need to make it longer.

But if someone supports your point, I will not mind to changing
the name.

> + sum += hashp->hctl->freeList[i].nalloced;
> + sum -= hashp->hctl->freeList[i].nfree;
>
> I think it would be better to calculate the difference between nalloced and nfree first, then add the result to sum (to avoid overflow).

Doesn't really matter much, because calculation must be valid
even if all nfree==0.

I'd rather debate use of 'long' in dynahash at all: 'long' is
32bit on 64bit Windows. It is better to use 'Size' here.

But 'nelements' were 'long', so I didn't change things. I think
it is place for another patch.

(On the other hand, dynahash with 2**31 elements is at least
512GB RAM... we doubtfully trigger problem before OOM killer
came. Does Windows have an OOM killer?)

> Subject: [PATCH 3/3] reduce memory allocation for non-partitioned dynahash
>
> memory allocation -> memory allocations

For each dynahash instance single allocation were reduced.
I think, 'memory allocation' is correct.

Plural will be
reduce memory allocations for non-partitioned dynahashes
ie both 'allocations' and 'dynahashes'.
Am I wrong?

Hi,

bq. reduce memory allocation for non-partitioned dynahash

It seems the following is clearer:

reduce one memory allocation for every non-partitioned dynahash

Cheers

Re: BufferAlloc: don't take two simultaneous locks

От

Kyotaro Horiguchi

Дата:

14 марта 2022 г., 03:39:48

At Fri, 11 Mar 2022 11:30:27 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in
> В Пт, 11/03/2022 в 15:30 +0900, Kyotaro Horiguchi пишет:
> > At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in
> > > В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:
> > > > Ok, here is v4.
> > >
> > > And here is v5.
> > >
> > > First, there was compilation error in Assert in dynahash.c .
> > > Excuse me for not checking before sending previous version.
> > >
> > > Second, I add third commit that reduces HASHHDR allocation
> > > size for non-partitioned dynahash:
> > > - moved freeList to last position
> > > - alloc and memset offset(HASHHDR, freeList[1]) for
> > >   non-partitioned hash tables.
> > > I didn't benchmarked it, but I will be surprised if it
> > > matters much in performance sence.
> > >
> > > Third, I put all three commits into single file to not
> > > confuse commitfest application.
> >
> > Thanks!  I looked into dynahash part.
> >
> >  struct HASHHDR
> >  {
> > -       /*
> > -        * The freelist can become a point of contention in high-concurrency hash
> >
> > Why did you move around the freeList?
> >
> >
> > -       long            nentries;               /* number of entries in associated buckets */
> > +       long            nfree;                  /* number of free entries in the list */
> > +       long            nalloced;               /* number of entries initially allocated for
> >
> > Why do we need nfree?  HASH_ASSING should do the same thing with
> > HASH_REMOVE.  Maybe the reason is the code tries to put the detached
> > bucket to different free list, but we can just remember the
> > freelist_idx for the detached bucket as we do for hashp.  I think that
> > should largely reduce the footprint of this patch.
>
> If we keep nentries, then we need to fix nentries in both old
> "freeList" partition and new one. It is two freeList[partition]->mutex
> lock+unlock pairs.
>
> But count of free elements doesn't change, so if we change nentries
> to nfree, then no need to fix freeList[partition]->nfree counters,
> no need to lock+unlock.

Ah, okay. I missed that bucket reuse chages key in most cases.

But still I don't think its good to move entries around partition
freelists for another reason.  I'm afraid that the freelists get into
imbalanced state.  get_hash_entry prefers main shmem allocation than
other freelist so that could lead to freelist bloat, or worse
contension than the traditinal way involving more than two partitions.

I'll examine the possibility to resolve this...

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

От

Kyotaro Horiguchi

Дата:

14 марта 2022 г., 03:44:22

At Fri, 11 Mar 2022 12:34:32 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in
> В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:
> > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@g> > BufTableDelete(BufferTag *tagPtr,
uint32hashcode, bool reuse) 
> >
> > BufTableDelete considers both reuse and !reuse cases but
> > BufTableInsert doesn't and always does HASH_ASSIGN.  That looks
> > odd. We should use HASH_ENTER here.  Thus I think it is more
> > reasonable that HASH_ENTRY uses the stashed entry if exists and
> > needed, or returns it to freelist if exists but not needed.
> >
> > What do you think about this?
>
> Well... I don't like it but I don't mind either.
>
> Code in HASH_ENTER and HASH_ASSIGN cases differs much.
> On the other hand, probably it is possible to merge it carefuly.
> I'll try.

Honestly, I'm not sure it wins on performance basis. It just came from
interface consistency (mmm. a bit different, maybe.. convincibility?).

regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

От

Kyotaro Horiguchi

Дата:

14 марта 2022 г., 08:31:12

At Mon, 14 Mar 2022 09:39:48 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> I'll examine the possibility to resolve this...

The existence of nfree and nalloc made me confused and I found the
reason.

In the case where a parittion collects many REUSE-ASSIGN-REMOVEed
elemetns from other paritiotns, nfree gets larger than nalloced. This
is a strange point of the two counters. nalloced is only referred to
as (sum(nalloced[])). So we don't need nalloced per-partition basis
and the formula to calculate the number of used elements would be as
follows.

sum(nalloced - nfree)
= <total_nalloced> - sum(nfree)

We rarely create fresh elements in shared hashes so I don't think
there's additional contention on the <total_nalloced> even if it were
a global atomic.

So, the remaining issue is the possible imbalancement among
partitions. On second thought, by the current way, if there's a bad
deviation in partition-usage, a heavily hit partition finally collects
elements via get_hash_entry(). By the patch's way, similar thing
happens via the REUSE-ASSIGN-REMOVE sequence. But buffers once used
for something won't be freed until buffer invalidation. But bulk
buffer invalidation won't deviatedly distribute freed buffers among
partitions. So I conclude for now that is a non-issue.

So my opinion on the counters is:

I'd like to ask you to remove nalloced from partitions then add a
global atomic for the same use?

No need to do something for the possible deviation issue.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

14 марта 2022 г., 09:15:11

В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> At Mon, 14 Mar 2022 09:39:48 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > I'll examine the possibility to resolve this...
> 
> The existence of nfree and nalloc made me confused and I found the
> reason.
> 
> In the case where a parittion collects many REUSE-ASSIGN-REMOVEed
> elemetns from other paritiotns, nfree gets larger than nalloced.  This
> is a strange point of the two counters.  nalloced is only referred to
> as (sum(nalloced[])).  So we don't need nalloced per-partition basis
> and the formula to calculate the number of used elements would be as
> follows.
> 
>  sum(nalloced - nfree)
>  = <total_nalloced> - sum(nfree)
> 
> We rarely create fresh elements in shared hashes so I don't think
> there's additional contention on the <total_nalloced> even if it were
> a global atomic.
> 
> So, the remaining issue is the possible imbalancement among
> partitions.  On second thought, by the current way, if there's a bad
> deviation in partition-usage, a heavily hit partition finally collects
> elements via get_hash_entry().  By the patch's way, similar thing
> happens via the REUSE-ASSIGN-REMOVE sequence. But buffers once used
> for something won't be freed until buffer invalidation. But bulk
> buffer invalidation won't deviatedly distribute freed buffers among
> partitions.  So I conclude for now that is a non-issue.
> 
> So my opinion on the counters is:
> 
> I'd like to ask you to remove nalloced from partitions then add a
> global atomic for the same use?

I really believe it should be global. I made it per-partition to
not overcomplicate first versions. Glad you tell it.

I thought to protect it with freeList[0].mutex, but probably atomic
is better idea here. But which atomic to chose: uint64 or uint32?
Based on sizeof(long)?
Ok, I'll do in next version.

Whole get_hash_entry look strange.
Doesn't it better to cycle through partitions and only then go to
get_hash_entry?
May be there should be bitmap for non-empty free lists? 32bit for
32 partitions. But wouldn't bitmap became contention point itself?

> No need to do something for the possible deviation issue.

-------

regards
Yura Sokolov

Re: BufferAlloc: don't take two simultaneous locks

От

Kyotaro Horiguchi

Дата:

14 марта 2022 г., 11:12:48

At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in
> В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > I'd like to ask you to remove nalloced from partitions then add a
> > global atomic for the same use?
>
> I really believe it should be global. I made it per-partition to
> not overcomplicate first versions. Glad you tell it.
>
> I thought to protect it with freeList[0].mutex, but probably atomic
> is better idea here. But which atomic to chose: uint64 or uint32?
> Based on sizeof(long)?
> Ok, I'll do in next version.

Current nentries is a long (= int64 on CentOS). And uint32 can support
roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe
enough.  So it would be uint64.

> Whole get_hash_entry look strange.
> Doesn't it better to cycle through partitions and only then go to
> get_hash_entry?
> May be there should be bitmap for non-empty free lists? 32bit for
> 32 partitions. But wouldn't bitmap became contention point itself?

The code puts significance on avoiding contention caused by visiting
freelists of other partitions.  And perhaps thinks that freelist
shortage rarely happen.

I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
128kB shared buffers and I saw that get_hash_entry never takes the
!element_alloc() path and always allocate a fresh entry, then
saturates at 30 new elements allocated at the medium of a 100 seconds
run.

Then, I tried the same with the patch, and I am surprized to see that
the rise of the number of newly allocated elements didn't stop and
went up to 511 elements after the 100 seconds run.  So I found that my
concern was valid.  The change in dynahash actually
continuously/repeatedly causes lack of free list entries.  I'm not
sure how much the impact given on performance if we change
get_hash_entry to prefer other freelists, though.


By the way, there's the following comment in StrategyInitalize.

>     * Initialize the shared buffer lookup hashtable.
>     *
>     * Since we can't tolerate running out of lookup table entries, we must be
>     * sure to specify an adequate table size here.  The maximum steady-state
>     * usage is of course NBuffers entries, but BufferAlloc() tries to insert
>     * a new entry before deleting the old.  In principle this could be
>     * happening in each partition concurrently, so we could need as many as
>     * NBuffers + NUM_BUFFER_PARTITIONS entries.
>     */
>    InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);

"but BufferAlloc() tries to insert a new entry before deleting the
old." gets false by this patch but still need that additional room for
stashed entries.  It seems like needing a fix.



regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

От

Kyotaro Horiguchi

Дата:

14 марта 2022 г., 11:34:10

At Mon, 14 Mar 2022 17:12:48 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Then, I tried the same with the patch, and I am surprized to see that
> the rise of the number of newly allocated elements didn't stop and
> went up to 511 elements after the 100 seconds run.  So I found that my
> concern was valid.

Which means my last decision was wrong with high odds..

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

14 марта 2022 г., 14:57:38

В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:
> At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in 
> > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > > I'd like to ask you to remove nalloced from partitions then add a
> > > global atomic for the same use?
> > 
> > I really believe it should be global. I made it per-partition to
> > not overcomplicate first versions. Glad you tell it.
> > 
> > I thought to protect it with freeList[0].mutex, but probably atomic
> > is better idea here. But which atomic to chose: uint64 or uint32?
> > Based on sizeof(long)?
> > Ok, I'll do in next version.
> 
> Current nentries is a long (= int64 on CentOS). And uint32 can support
> roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe
> enough.  So it would be uint64.
> 
> > Whole get_hash_entry look strange.
> > Doesn't it better to cycle through partitions and only then go to
> > get_hash_entry?
> > May be there should be bitmap for non-empty free lists? 32bit for
> > 32 partitions. But wouldn't bitmap became contention point itself?
> 
> The code puts significance on avoiding contention caused by visiting
> freelists of other partitions.  And perhaps thinks that freelist
> shortage rarely happen.
> 
> I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
> 128kB shared buffers and I saw that get_hash_entry never takes the
> !element_alloc() path and always allocate a fresh entry, then
> saturates at 30 new elements allocated at the medium of a 100 seconds
> run.
> 
> Then, I tried the same with the patch, and I am surprized to see that
> the rise of the number of newly allocated elements didn't stop and
> went up to 511 elements after the 100 seconds run.  So I found that my
> concern was valid.  The change in dynahash actually
> continuously/repeatedly causes lack of free list entries.  I'm not
> sure how much the impact given on performance if we change
> get_hash_entry to prefer other freelists, though.

Well, it is quite strange SharedBufHash is not allocated as
HASH_FIXED_SIZE. Could you check what happens with this flag set?
I'll try as well.

Other way to reduce observed case is to remember freelist_idx for
reused entry. I didn't believe it matters much since entries migrated
netherless, but probably due to some hot buffers there are tention to
crowd particular freelist.

> By the way, there's the following comment in StrategyInitalize.
> 
> >        * Initialize the shared buffer lookup hashtable.
> >        *
> >        * Since we can't tolerate running out of lookup table entries, we must be
> >        * sure to specify an adequate table size here.  The maximum steady-state
> >        * usage is of course NBuffers entries, but BufferAlloc() tries to insert
> >        * a new entry before deleting the old.  In principle this could be
> >        * happening in each partition concurrently, so we could need as many as
> >        * NBuffers + NUM_BUFFER_PARTITIONS entries.
> >        */
> >       InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
> 
> "but BufferAlloc() tries to insert a new entry before deleting the
> old." gets false by this patch but still need that additional room for
> stashed entries.  It seems like needing a fix.
> 
> 
> 
> regards.
> 
> -- 
> Kyotaro Horiguchi
> NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

15 марта 2022 г., 08:07:39

В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:
> В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:
> > At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in 
> > > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > > > I'd like to ask you to remove nalloced from partitions then add a
> > > > global atomic for the same use?
> > > 
> > > I really believe it should be global. I made it per-partition to
> > > not overcomplicate first versions. Glad you tell it.
> > > 
> > > I thought to protect it with freeList[0].mutex, but probably atomic
> > > is better idea here. But which atomic to chose: uint64 or uint32?
> > > Based on sizeof(long)?
> > > Ok, I'll do in next version.
> > 
> > Current nentries is a long (= int64 on CentOS). And uint32 can support
> > roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe
> > enough.  So it would be uint64.
> > 
> > > Whole get_hash_entry look strange.
> > > Doesn't it better to cycle through partitions and only then go to
> > > get_hash_entry?
> > > May be there should be bitmap for non-empty free lists? 32bit for
> > > 32 partitions. But wouldn't bitmap became contention point itself?
> > 
> > The code puts significance on avoiding contention caused by visiting
> > freelists of other partitions.  And perhaps thinks that freelist
> > shortage rarely happen.
> > 
> > I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
> > 128kB shared buffers and I saw that get_hash_entry never takes the
> > !element_alloc() path and always allocate a fresh entry, then
> > saturates at 30 new elements allocated at the medium of a 100 seconds
> > run.
> > 
> > Then, I tried the same with the patch, and I am surprized to see that
> > the rise of the number of newly allocated elements didn't stop and
> > went up to 511 elements after the 100 seconds run.  So I found that my
> > concern was valid.  The change in dynahash actually
> > continuously/repeatedly causes lack of free list entries.  I'm not
> > sure how much the impact given on performance if we change
> > get_hash_entry to prefer other freelists, though.
> 
> Well, it is quite strange SharedBufHash is not allocated as
> HASH_FIXED_SIZE. Could you check what happens with this flag set?
> I'll try as well.
> 
> Other way to reduce observed case is to remember freelist_idx for
> reused entry. I didn't believe it matters much since entries migrated
> netherless, but probably due to some hot buffers there are tention to
> crowd particular freelist.

Well, I did both. Everything looks ok.

> > By the way, there's the following comment in StrategyInitalize.
> > 
> > >        * Initialize the shared buffer lookup hashtable.
> > >        *
> > >        * Since we can't tolerate running out of lookup table entries, we must be
> > >        * sure to specify an adequate table size here.  The maximum steady-state
> > >        * usage is of course NBuffers entries, but BufferAlloc() tries to insert
> > >        * a new entry before deleting the old.  In principle this could be
> > >        * happening in each partition concurrently, so we could need as many as
> > >        * NBuffers + NUM_BUFFER_PARTITIONS entries.
> > >        */
> > >       InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
> > 
> > "but BufferAlloc() tries to insert a new entry before deleting the
> > old." gets false by this patch but still need that additional room for
> > stashed entries.  It seems like needing a fix.

Removed whole paragraph because fixed table without extra entries works
just fine.

I lost access to Xeon 8354H, so returned to old Xeon X5675.

128MB and 1GB shared buffers
pgbench with scale 100
select_only benchmark, unix sockets.

Notebook i7-1165G7:


  conns |     master |         v8 |  master 1G |      v8 1G 
--------+------------+------------+------------+------------
      1 |      29614 |      29285 |      32413 |      32784 
      2 |      58541 |      60052 |      65851 |      65938 
      3 |      91126 |      90185 |     101404 |     101956 
      5 |     135809 |     133670 |     143783 |     143471 
      7 |     155547 |     153568 |     162566 |     162361 
     17 |     221794 |     218143 |     250562 |     250136 
     27 |     213742 |     211226 |     241806 |     242594 
     53 |     216067 |     214792 |     245868 |     246269 
     83 |     216610 |     218261 |     246798 |     250515 
    107 |     216169 |     216656 |     248424 |     250105 
    139 |     208892 |     215054 |     244630 |     246439 
    163 |     206988 |     212751 |     244061 |     248051 
    191 |     203842 |     214764 |     241793 |     245081 
    211 |     201304 |     213997 |     240863 |     246076 
    239 |     199313 |     211713 |     239639 |     243586 
    271 |     196712 |     211849 |     236231 |     243831 
    307 |     194879 |     209813 |     233811 |     241303 
    353 |     191279 |     210145 |     230896 |     241039 
    397 |     188509 |     207480 |     227812 |     240637 

X5675 1 socket:

  conns |     master |         v8 |  master 1G |      v8 1G 
--------+------------+------------+------------+------------
      1 |      18590 |      18473 |      19652 |      19051 
      2 |      34899 |      34799 |      37242 |      37432 
      3 |      51484 |      51393 |      54750 |      54398 
      5 |      71037 |      70564 |      76482 |      75985 
      7 |      87391 |      86937 |      96185 |      95433 
     17 |     122609 |     123087 |     140578 |     140325 
     27 |     120051 |     120508 |     136318 |     136343 
     53 |     116851 |     117601 |     133338 |     133265 
     83 |     113682 |     116755 |     131841 |     132736 
    107 |     111925 |     116003 |     130661 |     132386 
    139 |     109338 |     115011 |     128319 |     131453 
    163 |     107661 |     114398 |     126684 |     130677 
    191 |     105000 |     113745 |     124850 |     129909 
    211 |     103607 |     113347 |     123469 |     129302 
    239 |     101820 |     112428 |     121752 |     128621 
    271 |     100060 |     111863 |     119743 |     127624 
    307 |      98554 |     111270 |     117650 |     126877 
    353 |      97530 |     110231 |     115904 |     125351 
    397 |      96122 |     109471 |     113609 |     124150 

X5675 2 socket:

  conns |     master |         v8 |  master 1G |      v8 1G 
--------+------------+------------+------------+------------
      1 |      17815 |      17577 |      19321 |      19187 
      2 |      34312 |      35655 |      37121 |      36479 
      3 |      51868 |      52165 |      56048 |      54984 
      5 |      81704 |      82477 |      90945 |      90109 
      7 |     107937 |     105411 |     116015 |     115810 
     17 |     191339 |     190813 |     216899 |     215775 
     27 |     236541 |     238078 |     278507 |     278073 
     53 |     230323 |     231709 |     267226 |     267449 
     83 |     225560 |     227455 |     261996 |     262344 
    107 |     221317 |     224030 |     259694 |     259553 
    139 |     206945 |     219005 |     254817 |     256736 
    163 |     197723 |     220353 |     251631 |     257305 
    191 |     193243 |     219149 |     246960 |     256528 
    211 |     189603 |     218545 |     245362 |     255785 
    239 |     186382 |     217229 |     240006 |     255024 
    271 |     183141 |     216359 |     236927 |     253069 
    307 |     179275 |     215218 |     232571 |     252375 
    353 |     175559 |     213298 |     227244 |     250534 
    397 |     172916 |     211627 |     223513 |     248919 

Strange thing: both master and patched version has higher
peak tps at X5676 at medium connections (17 or 27 clients)
than in first october version [1]. But lower tps at higher
connections number (>= 191 clients).
I'll try to bisect on master this unfortunate change.

October master was 2d44dee0281a1abf and today's is 7e12256b478b895

(There is small possibility that I tested with TCP sockets
in october and with UNIX sockets today and that gave difference.)

[1] https://postgr.esq/m/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru

-------

regards
Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru

В Чт, 17/03/2022 в 12:02 +0900, Kyotaro Horiguchi пишет:
> At Wed, 16 Mar 2022 14:11:58 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in 
> > В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:
> > > At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in 
> > > In v7, HASH_ENTER returns the element stored in DynaHashReuse using
> > > the freelist_idx of the new key.  v8 uses that of the old key (at the
> > > time of HASH_REUSE).  So in the case "REUSE->ENTER(elem exists and
> > > returns the stashed)" case the stashed element is returned to its
> > > original partition.  But it is not what I mentioned.
> > > 
> > > On the other hand, once the stahsed element is reused by HASH_ENTER,
> > > it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
> > > from old partition) case.  I suspect that ththat the frequent freelist
> > > starvation comes from the latter case.
> > 
> > Doubtfully. Due to probabilty theory, single partition doubdfully
> > will be too overflowed. Therefore, freelist.
> 
> Yeah.  I think so generally.
> 
> > But! With 128kb shared buffers there is just 32 buffers. 32 entry for
> > 32 freelist partition - certainly some freelist partition will certainly
> > have 0 entry even if all entries are in freelists. 
> 
> Anyway, it's an extreme condition and the starvation happens only at a
> neglegible ratio.
> 
> > > RETURNED: 2
> > > ALLOCED: 0
> > > BORROWED: 435
> > > REUSED: 495444
> > > ASSIGNED: 495467 (-23)
> > > 
> > > Now "BORROWED" happens 0.8% of REUSED
> > 
> > 0.08% actually :)
> 
> Mmm.  Doesn't matter:p
> 
> > > > > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > > > > ...
> > > > > > Strange thing: both master and patched version has higher
> > > > > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > > > > than in first october version [1]. But lower tps at higher
> > > > > > connections number (>= 191 clients).
> > > > > > I'll try to bisect on master this unfortunate change.
> ...
> > I've checked. Looks like something had changed on the server, since
> > old master commit behaves now same to new one (and differently to
> > how it behaved in October).
> > I remember maintainance downtime of the server in november/december.
> > Probably, kernel were upgraded or some system settings were changed.
> 
> One thing I have a little concern is that numbers shows 1-2% of
> degradation steadily for connection numbers < 17.
> 
> I think there are two possible cause of the degradation.
> 
> 1. Additional branch by consolidating HASH_ASSIGN into HASH_ENTER.
>   This might cause degradation for memory-contended use.
> 
> 2. nallocs operation might cause degradation on non-shared dynahasyes?
>   I believe doesn't but I'm not sure.
> 
>   On a simple benchmarking with pgbench on a laptop, dynahash
>   allocation (including shared and non-shared) happend about at 50
>   times per second with 10 processes and 200 with 100 processes.
> 
> > > I don't think nalloced needs to be the same width to long.  For the
> > > platforms with 32-bit long, anyway the possible degradation if any by
> > > 64-bit atomic there doesn't matter.  So don't we always define the
> > > atomic as 64bit and use the pg_atomic_* functions directly?
> > 
> > Some 32bit platforms has no native 64bit atomics. Then they are
> > emulated with locks.
> > 
> > Well, and for 32bit platform long is just enough. Why spend other
> > 4 bytes per each dynahash?
> 
> I don't think additional bytes doesn't matter, but emulated atomic
> operations can matter. However I'm not sure which platform uses that
> fallback implementations.  (x86 seems to have __sync_fetch_and_add()
> since P4).
> 
> My opinion in the previous mail is that if that level of degradation
> caued by emulated atomic operations matters, we shouldn't use atomic
> there at all since atomic operations on the modern platforms are not
> also free.
> 
> In relation to 2 above, if we observe that the degradation disappears
> by (tentatively) use non-atomic operations for nalloced, we should go
> back to the previous per-freelist nalloced.

Here is version with nalloced being union of appropriate atomic and
long.

------

regards
Yura Sokolov

Вложения

v9-bufmgr-lock-improvements.patch

Re: BufferAlloc: don't take two simultaneous locks

От

Yura Sokolov

Дата:

06 апреля 2022 г., 16:17:28

Good day, Kyotaoro-san.
Good day, hackers.

В Вс, 20/03/2022 в 12:38 +0300, Yura Sokolov пишет:
> В Чт, 17/03/2022 в 12:02 +0900, Kyotaro Horiguchi пишет:
> > At Wed, 16 Mar 2022 14:11:58 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in 
> > > В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:
> > > > At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in 
> > > > In v7, HASH_ENTER returns the element stored in DynaHashReuse using
> > > > the freelist_idx of the new key.  v8 uses that of the old key (at the
> > > > time of HASH_REUSE).  So in the case "REUSE->ENTER(elem exists and
> > > > returns the stashed)" case the stashed element is returned to its
> > > > original partition.  But it is not what I mentioned.
> > > > 
> > > > On the other hand, once the stahsed element is reused by HASH_ENTER,
> > > > it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
> > > > from old partition) case.  I suspect that ththat the frequent freelist
> > > > starvation comes from the latter case.
> > > 
> > > Doubtfully. Due to probabilty theory, single partition doubdfully
> > > will be too overflowed. Therefore, freelist.
> > 
> > Yeah.  I think so generally.
> > 
> > > But! With 128kb shared buffers there is just 32 buffers. 32 entry for
> > > 32 freelist partition - certainly some freelist partition will certainly
> > > have 0 entry even if all entries are in freelists. 
> > 
> > Anyway, it's an extreme condition and the starvation happens only at a
> > neglegible ratio.
> > 
> > > > RETURNED: 2
> > > > ALLOCED: 0
> > > > BORROWED: 435
> > > > REUSED: 495444
> > > > ASSIGNED: 495467 (-23)
> > > > 
> > > > Now "BORROWED" happens 0.8% of REUSED
> > > 
> > > 0.08% actually :)
> > 
> > Mmm.  Doesn't matter:p
> > 
> > > > > > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > > > > > ...
> > > > > > > Strange thing: both master and patched version has higher
> > > > > > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > > > > > than in first october version [1]. But lower tps at higher
> > > > > > > connections number (>= 191 clients).
> > > > > > > I'll try to bisect on master this unfortunate change.
> > ...
> > > I've checked. Looks like something had changed on the server, since
> > > old master commit behaves now same to new one (and differently to
> > > how it behaved in October).
> > > I remember maintainance downtime of the server in november/december.
> > > Probably, kernel were upgraded or some system settings were changed.
> > 
> > One thing I have a little concern is that numbers shows 1-2% of
> > degradation steadily for connection numbers < 17.
> > 
> > I think there are two possible cause of the degradation.
> > 
> > 1. Additional branch by consolidating HASH_ASSIGN into HASH_ENTER.
> >   This might cause degradation for memory-contended use.
> > 
> > 2. nallocs operation might cause degradation on non-shared dynahasyes?
> >   I believe doesn't but I'm not sure.
> > 
> >   On a simple benchmarking with pgbench on a laptop, dynahash
> >   allocation (including shared and non-shared) happend about at 50
> >   times per second with 10 processes and 200 with 100 processes.
> > 
> > > > I don't think nalloced needs to be the same width to long.  For the
> > > > platforms with 32-bit long, anyway the possible degradation if any by
> > > > 64-bit atomic there doesn't matter.  So don't we always define the
> > > > atomic as 64bit and use the pg_atomic_* functions directly?
> > > 
> > > Some 32bit platforms has no native 64bit atomics. Then they are
> > > emulated with locks.
> > > 
> > > Well, and for 32bit platform long is just enough. Why spend other
> > > 4 bytes per each dynahash?
> > 
> > I don't think additional bytes doesn't matter, but emulated atomic
> > operations can matter. However I'm not sure which platform uses that
> > fallback implementations.  (x86 seems to have __sync_fetch_and_add()
> > since P4).
> > 
> > My opinion in the previous mail is that if that level of degradation
> > caued by emulated atomic operations matters, we shouldn't use atomic
> > there at all since atomic operations on the modern platforms are not
> > also free.
> > 
> > In relation to 2 above, if we observe that the degradation disappears
> > by (tentatively) use non-atomic operations for nalloced, we should go
> > back to the previous per-freelist nalloced.
> 
> Here is version with nalloced being union of appropriate atomic and
> long.
> 

Ok, I got access to stronger server, did the benchmark, found weird
things, and so here is new version :-)

First I found if table size is strictly limited to NBuffers and FIXED,
then under high concurrency get_hash_entry may not find free entry
despite it must be there. It seems while process scans free lists, other
concurrent processes "moves entry around", ie one concurrent process
fetched it from one free list, other process put new entry in other
freelist, and unfortunate process missed it since it tests freelists
only once.

Second, I confirm there is problem with freelist spreading.
If I keep entry's freelist_idx, then one freelist is crowded.
If I use new entry's freelist_idx, then one freelist is emptified
constantly.

Third, I found increased concurrency could harm. When popular block is
evicted for some reason, then thundering herd effect occures: many
backends wants to read same block, they evict many other buffers, but
only one is inserted. Other goes to freelist. Evicted buffers by itself
reduce cache hit ratio and provocates more work. Old version resists
this effect by not removing old buffer before new entry is successfully
inserted.

To fix this issues I made following:

# Concurrency

First, I limit concurrency by introducing other lwlocks tranche -
BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs
128).
If backend doesn't find buffer in buffer table and wants to introduce
it, it first calls
    LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE)
If lock were acquired, then it goes to eviction and replace process.
Otherwise, it waits lock to be released and repeats search.

This greately improve performance for > 400 clients in pgbench.

I tried other variant as well:
- first insert entry with dummy buffer index into buffer table.
- if such entry were already here, then wait it to be filled.
- otherwise find victim buffer and replace dummy index with new one.
Wait were done with shared lock on EvictPartitionLock as well.
This variant performed quite same.

Logically I like that variant more, but there is one gotcha: 
FlushBuffer could fail with elog(ERROR). Therefore then there is
a need to reliable remove entry with dummy index.
And after all, I still need to hold EvictPartitionLock to notice
waiters.

I've tried to use ConditionalVariable, but its performance were much
worse.

# Dynahash capacity and freelists.

I returned back buffer table initialization:
- removed FIXES_SIZE restriction introduced in previous version
- returned `NBuffers + NUM_BUFFER_PARTITIONS`.
I really think, there should be more spare items, since almost always
entry_alloc is called at least once (on 128MB shared_buffers). But
let keep it as is for now.

`get_hash_entry` were changed to probe NUM_FREELISTS/4 (==8) freelists
before falling back to `entry_alloc`, and probing is changed from
linear to quadratic. This greately reduces number of calls to
`entry_alloc`, so more shared memory left intact. And I didn't notice
large performance hit from. Probably there is some, but I think it is
adequate trade-off.

`free_reused_entry` now returns entry to random position. It flattens
free entry's spread. Although it is not enough without other changes
(thundering herd mitigation and probing more lists in get_hash_entry).

# Benchmarks

Benchmarked on two socket Xeon(R) Gold 5220 CPU @2.20GHz
18 cores per socket + hyper-threading - upto 72 virtual core total.
turbo-boost disabled
Linux 5.10.103-1 Debian.

pgbench scale 100 simple_select + simple select with 3 keys (sql file
attached).

shared buffers 128MB & 1GB
huge_pages=on

1 socket
  conns |     master |  patch-v11 |  master 1G | patch-v11 1G 
--------+------------+------------+------------+------------
      1 |      27882 |      27738 |      32735 |      32439 
      2 |      54082 |      54336 |      64387 |      63846 
      3 |      80724 |      81079 |      96387 |      94439 
      5 |     134404 |     133429 |     160085 |     157399 
      7 |     185977 |     184502 |     219916 |     217142 
     17 |     335345 |     338214 |     393112 |     388796 
     27 |     393686 |     394948 |     447945 |     444915 
     53 |     572234 |     577092 |     678884 |     676493 
     83 |     558875 |     561689 |     669212 |     655697 
    107 |     553054 |     551896 |     654550 |     646010 
    139 |     541263 |     538354 |     641937 |     633840 
    163 |     532932 |     531829 |     635127 |     627600 
    191 |     524647 |     524442 |     626228 |     617347 
    211 |     521624 |     522197 |     629740 |     613143 
    239 |     509448 |     554894 |     652353 |     652972 
    271 |     468190 |     557467 |     647403 |     661348 
    307 |     454139 |     558694 |     642229 |     657649 
    353 |     446853 |     554301 |     635991 |     654571 
    397 |     441909 |     549822 |     625194 |     647973 

1 socket 3 keys

  conns |     master |  patch-v11 |  master 1G | patch-v11 1G 
--------+------------+------------+------------+------------
      1 |      16677 |      16477 |      22219 |      22030 
      2 |      32056 |      31874 |      43298 |      43153 
      3 |      48091 |      47766 |      64877 |      64600 
      5 |      78999 |      78609 |     105433 |     106101 
      7 |     108122 |     107529 |     148713 |     145343 
     17 |     205656 |     209010 |     272676 |     271449 
     27 |     252015 |     254000 |     323983 |     323499 
     53 |     317928 |     334493 |     446740 |     449641 
     83 |     299234 |     327738 |     437035 |     443113 
    107 |     290089 |     322025 |     430535 |     431530 
    139 |     277294 |     314384 |     422076 |     423606 
    163 |     269029 |     310114 |     416229 |     417412 
    191 |     257315 |     306530 |     408487 |     416170 
    211 |     249743 |     304278 |     404766 |     416393 
    239 |     243333 |     310974 |     397139 |     428167 
    271 |     236356 |     309215 |     389972 |     427498 
    307 |     229094 |     307519 |     382444 |     425891 
    353 |     224385 |     305366 |     375020 |     423284 
    397 |     218549 |     302577 |     364373 |     420846 

2 sockets

  conns |     master |  patch-v11 |  master 1G | patch-v11 1G 
--------+------------+------------+------------+------------
      1 |      27287 |      27631 |      32943 |      32493 
      2 |      52397 |      54011 |      64572 |      63596 
      3 |      76157 |      80473 |      93363 |      93528 
      5 |     127075 |     134310 |     153176 |     149984 
      7 |     177100 |     176939 |     216356 |     211599 
     17 |     379047 |     383179 |     464249 |     470351 
     27 |     545219 |     546706 |     664779 |     662488 
     53 |     728142 |     728123 |     857454 |     869407 
     83 |     918276 |     957722 |    1215252 |    1203443 
    107 |     884112 |     971797 |    1206930 |    1234606 
    139 |     822564 |     970920 |    1167518 |    1233230 
    163 |     788287 |     968248 |    1130021 |    1229250 
    191 |     772406 |     959344 |    1097842 |    1218541 
    211 |     756085 |     955563 |    1077747 |    1209489 
    239 |     732926 |     948855 |    1050096 |    1200878 
    271 |     692999 |     941722 |    1017489 |    1194012 
    307 |     668241 |     920478 |     994420 |    1179507 
    353 |     642478 |     908645 |     968648 |    1174265 
    397 |     617673 |     893568 |     950736 |    1173411 

2 sockets 3 keys

  conns |     master |  patch-v11 |  master 1G | patch-v11 1G 
--------+------------+------------+------------+------------
      1 |      16722 |      16393 |      20340 |      21813 
      2 |      32057 |      32009 |      39993 |      42959 
      3 |      46202 |      47678 |      59216 |      64374 
      5 |      78882 |      72002 |      98054 |     103731 
      7 |     103398 |      99538 |     135098 |     135828 
     17 |     205863 |     217781 |     293958 |     299690 
     27 |     283526 |     290539 |     414968 |     411219 
     53 |     336717 |     356130 |     460596 |     474563 
     83 |     307310 |     342125 |     419941 |     469989 
    107 |     294059 |     333494 |     405706 |     469593 
    139 |     278453 |     328031 |     390984 |     470553 
    163 |     270833 |     326457 |     384747 |     470977 
    191 |     259591 |     322590 |     376582 |     470335 
    211 |     263584 |     321263 |     375969 |     469443 
    239 |     257135 |     316959 |     370108 |     470904 
    271 |     251107 |     315393 |     365794 |     469517 
    307 |     246605 |     311585 |     360742 |     467566 
    353 |     236899 |     308581 |     353464 |     466936 
    397 |     249036 |     305042 |     344673 |     466842 

I skipped v10 since I used it internally for variant
"insert entry with dummy index then search victim".


------

regards

Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Btw, I've runned tests on EPYC (80 cores).

1 key per select
  conns |     master |  patch-v11 |  master 1G | patch-v11 1G 
--------+------------+------------+------------+------------
      1 |      29053 |      28959 |      26715 |      25631 
      2 |      53714 |      53002 |      55211 |      53699 
      3 |      69796 |      72100 |      72355 |      71164 
      5 |     118045 |     112066 |     122182 |     119825 
      7 |     151933 |     156298 |     162001 |     160834 
     17 |     344594 |     347809 |     390103 |     386676 
     27 |     497656 |     527313 |     587806 |     598450 
     53 |     732524 |     853831 |     906569 |     947050 
     83 |     823203 |     991415 |    1056884 |    1222530 
    107 |     812730 |     930175 |    1004765 |    1232307 
    139 |     781757 |     938718 |     995326 |    1196653 
    163 |     758991 |     969781 |     990644 |    1143724 
    191 |     774137 |     977633 |     996763 |    1210899 
    211 |     771856 |     973361 |    1024798 |    1187824 
    239 |     756925 |     940808 |     954326 |    1165303 
    271 |     756220 |     940508 |     970254 |    1198773 
    307 |     746784 |     941038 |     940369 |    1159446 
    353 |     710578 |     928296 |     923437 |    1189575 
    397 |     715352 |     915931 |     911638 |    1180688 

3 keys per select

  conns |     master |  patch-v11 |  master 1G | patch-v11 1G 
--------+------------+------------+------------+------------
      1 |      17448 |      17104 |      18359 |      19077 
      2 |      30888 |      31650 |      35074 |      35861 
      3 |      44653 |      43371 |      47814 |      47360 
      5 |      69632 |      64454 |      76695 |      76208 
      7 |      96385 |      92526 |     107587 |     107930 
     17 |     195157 |     205156 |     253440 |     239740 
     27 |     302343 |     316768 |     386748 |     335148 
     53 |     334321 |     396359 |     402506 |     486341 
     83 |     300439 |     374483 |     408694 |     452731 
    107 |     302768 |     369207 |     390599 |     453817 
    139 |     294783 |     364885 |     379332 |     459884 
    163 |     272646 |     344643 |     376629 |     460839 
    191 |     282307 |     334016 |     363322 |     449928 
    211 |     275123 |     321337 |     371023 |     445246 
    239 |     263072 |     341064 |     356720 |     441250 
    271 |     271506 |     333066 |     373994 |     436481 
    307 |     261545 |     333489 |     348569 |     466673 
    353 |     255700 |     331344 |     333792 |     455430 
    397 |     247745 |     325712 |     326680 |     439245

Good day, hackers.

This is continuation of BufferAlloc saga.

This time I've tried to implement approach:
- if there's no buffer, insert placeholder
- then find victim
- if other backend wants to insert same buffer, it waits on
  ConditionVariable.

Patch make separate ConditionVariable per backend, and placeholder
contains backend id. So waiters don't suffer from collision on
partition, they wait exactly for concrete buffer.

This patch doesn't contain any dynahash changes since order of
operation doesn't change: "insert then delete". So there is no way to
"reserve" entry.

But it contains changes to ConditionVariable:

- adds ConditionVariableSleepOnce, which doesn't reinsert process back
  on CV's proclist.
  This method could not be used in loop as ConditionVariableSleep,
  and ConditionVariablePrepareSleep must be called before.
  
- adds ConditionVariableBroadcastFast - improvement over regular
  ConditionVariableBroadcast that awakes processes in batches.
  So CVBroadcastFast doesn't acquire/release CV's spinlock mutex for
  every proclist entry, but rather for batch of entries.
  
  I believe, it could safely replace ConditionVariableBroadcast. Though
  I didn't try yet to replace and check.

Tests:
- tests done on 2 socket Xeon 5220 2.20GHz with turbo bust disabled
  (ie max frequency is 2.20GHz)
- runs on 1 socket or 2 sockets using numactl
- pgbench scale 100 - 1.5GB of data
- shared_buffers : 128MB, 1GB (and 2GB)
- variations of simple_select with 1 key per query, 3 keys per query
  and 10 keys per query.

1 socket 1 key

  conns |  master 128M |     v12 128M |    master 1G |       v12 1G 
--------+--------------+--------------+--------------+--------------
      1 |        25670 |        24926 |        29491 |        28858 
      2 |        50157 |        48894 |        58356 |        57180 
      3 |        75036 |        72904 |        87152 |        84869 
      5 |       124479 |       120720 |       143550 |       140799 
      7 |       168586 |       164277 |       199360 |       195578 
     17 |       319943 |       314010 |       364963 |       358550 
     27 |       423617 |       420528 |       491493 |       485139 
     53 |       491357 |       490994 |       574477 |       571753 
     83 |       487029 |       486750 |       571057 |       566335 
    107 |       478429 |       479862 |       565471 |       560115 
    139 |       467953 |       469981 |       556035 |       551056 
    163 |       459467 |       463272 |       548976 |       543660 
    191 |       448420 |       456105 |       540881 |       534556 
    211 |       440229 |       458712 |       545195 |       535333 
    239 |       431754 |       471373 |       547111 |       552591 
    271 |       421767 |       473479 |       544014 |       557910 
    307 |       408234 |       474285 |       539653 |       556629 
    353 |       389360 |       472491 |       534719 |       554696 
    397 |       377063 |       471513 |       527887 |       554383 

1 socket 3 keys

  conns |  master 128M |     v12 128M |    master 1G |       v12 1G 
--------+--------------+--------------+--------------+--------------
      1 |        15277 |        14917 |        20109 |        19564 
      2 |        29587 |        28892 |        39430 |        36986 
      3 |        44204 |        43198 |        58993 |        57196 
      5 |        71471 |        68703 |        96923 |        92497 
      7 |        98823 |        97823 |       133173 |       130134 
     17 |       201351 |       198865 |       258139 |       254702 
     27 |       254959 |       255503 |       338117 |       339044 
     53 |       277048 |       291923 |       384300 |       390812 
     83 |       251486 |       287247 |       376170 |       385302 
    107 |       232037 |       281922 |       365585 |       380532 
    139 |       210478 |       276544 |       352430 |       373815 
    163 |       193875 |       271842 |       341636 |       368034 
    191 |       179544 |       267033 |       334408 |       362985 
    211 |       172837 |       269329 |       330287 |       366478 
    239 |       162647 |       272046 |       322646 |       371807 
    271 |       153626 |       271423 |       314017 |       371062 
    307 |       144122 |       270540 |       305358 |       370462 
    353 |       129544 |       268239 |       292867 |       368162 
    397 |       123430 |       267112 |       284394 |       366845 
    
1 socket 10 keys

  conns |  master 128M |     v12 128M |    master 1G |       v12 1G 
--------+--------------+--------------+--------------+--------------
      1 |         6824 |         6735 |        10475 |        10220 
      2 |        13037 |        12628 |        20382 |        19849 
      3 |        19416 |        19043 |        30369 |        29554 
      5 |        31756 |        30657 |        49402 |        48614 
      7 |        42794 |        42179 |        67526 |        65071 
     17 |        91443 |        89772 |       139630 |       139929 
     27 |       107751 |       110689 |       165996 |       169955 
     53 |        97128 |       120621 |       157670 |       184382 
     83 |        82344 |       117814 |       142380 |       183863 
    107 |        70764 |       115841 |       134266 |       182426 
    139 |        57561 |       112528 |       125090 |       180121 
    163 |        50490 |       110443 |       119932 |       178453 
    191 |        45143 |       108583 |       114690 |       175899 
    211 |        42375 |       107604 |       111444 |       174109 
    239 |        39861 |       106702 |       106253 |       172410 
    271 |        37398 |       105819 |       102260 |       170792 
    307 |        35279 |       105355 |        97164 |       168313 
    353 |        33427 |       103537 |        91629 |       166232 
    397 |        31778 |       101793 |        87230 |       164381 
    
2 sockets 1 key

  conns |  master 128M |     v12 128M |    master 1G |       v12 1G 
--------+--------------+--------------+--------------+--------------
      1 |        24839 |        24386 |        29246 |        28361 
      2 |        46655 |        45265 |        55942 |        54327 
      3 |        69278 |        68332 |        83984 |        81608 
      5 |       115263 |       112746 |       139012 |       135426 
      7 |       159881 |       155119 |       193846 |       188399 
     17 |       373808 |       365085 |       456463 |       441603 
     27 |       503663 |       495443 |       600335 |       584741 
     53 |       708849 |       744274 |       900923 |       908488 
     83 |       593053 |       862003 |       985953 |      1038033 
    107 |       431806 |       875704 |       957115 |      1075172 
    139 |       328380 |       879890 |       881652 |      1069872 
    163 |       288339 |       874792 |       824619 |      1064047 
    191 |       255666 |       870532 |       790583 |      1061124 
    211 |       241230 |       865975 |       764898 |      1058473 
    239 |       227344 |       857825 |       732353 |      1049745 
    271 |       216095 |       848240 |       703729 |      1043182 
    307 |       206978 |       833980 |       674711 |      1031533 
    353 |       198426 |       803830 |       633783 |      1018479 
    397 |       191617 |       744466 |       599170 |      1006134 
    
2 sockets 3 keys

  conns |  master 128M |     v12 128M |    master 1G |       v12 1G 
--------+--------------+--------------+--------------+--------------
      1 |        14688 |        14088 |        18912 |        18905 
      2 |        26759 |        25925 |        36817 |        35924 
      3 |        40002 |        38658 |        54765 |        53266 
      5 |        63479 |        63041 |        90521 |        87496 
      7 |        88561 |        87101 |       123425 |       121877 
     17 |       199411 |       196932 |       289555 |       282146 
     27 |       270121 |       275950 |       386884 |       383019
     53 |       202918 |       374848 |       395967 |       501648 
     83 |       149599 |       363623 |       335815 |       478628 
    107 |       126501 |       348125 |       311617 |       472473 
    139 |       106091 |       331350 |       279843 |       466408 
    163 |        95497 |       321978 |       260884 |       461688 
    191 |        87427 |       312815 |       241189 |       458252 
    211 |        82783 |       307261 |       231435 |       454327 
    239 |        78930 |       299661 |       219655 |       451826 
    271 |        74081 |       294233 |       211555 |       448412 
    307 |        71352 |       288133 |       202838 |       446143 
    353 |        67872 |       279948 |       193354 |       441929 
    397 |        66178 |       275784 |       185556 |       438330 

2 sockets 10 keys

  conns |  master 128M |     v12 128M |    master 1G |       v12 1G 
--------+--------------+--------------+--------------+--------------
      1 |         6200 |         6108 |        10163 |         9563 
      2 |        11196 |        10871 |        18373 |        17827 
      3 |        16479 |        16129 |        26807 |        26584 
      5 |        26750 |        26241 |        44291 |        43409 
      7 |        36501 |        35433 |        60508 |        59379 
     17 |        77320 |        77451 |       130413 |       128452 
     27 |        91833 |       105643 |       147259 |       156833 
     53 |        57138 |       115793 |       119306 |       150647 
     83 |        44435 |       108850 |       105454 |       148006 
    107 |        38031 |       105199 |        95108 |       146162 
    139 |        31697 |       101096 |        84011 |       143281 
    163 |        28826 |        98255 |        78411 |       141375 
    191 |        26223 |        96224 |        74256 |       139646 
    211 |        24933 |        94815 |        71542 |       137834 
    239 |        23626 |        92849 |        69289 |       137235 
    271 |        22664 |        90938 |        66431 |       136080 
    307 |        21691 |        89358 |        64661 |       133166 
    353 |        20712 |        88239 |        61619 |       133339 
    397 |        20374 |        86708 |        58937 |       130684 

Well, as you see, there is some regression on low connection numbers.
I don't get where it from.

More over, it is even in case of 2GB shared buffers - when all data
fits into buffers cache and new code doesn't work at all.
(except this incomprehensible regression there's no different in
 performance with 2GB shared buffers).

For example 2GB shared buffers 1 socket 3 keys:
  conns |    master 2G |       v12 2G 
--------+--------------+--------------
      1 |        23491 |        22621 
      2 |        46436 |        44851 
      3 |        69265 |        66844 
      5 |       112432 |       108801 
      7 |       158859 |       150247 
     17 |       297600 |       291605 
     27 |       390041 |       384590 
     53 |       448384 |       447588 
     83 |       445582 |       442048 
    107 |       440544 |       438200 
    139 |       433893 |       430818 
    163 |       427436 |       424182 
    191 |       420854 |       417045 
    211 |       417228 |       413456 

Perhaps something changes in memory layout due to array of CV's, or
compiler layouts/optimizes functions differently. I can't find the
reason ;-( I would appreciate help on this.


regards

---

Yura Sokolov

On Tue, Jun 28, 2022 at 4:50 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

В Вт, 28/06/2022 в 14:26 +0300, Yura Sokolov пишет:
> В Вт, 28/06/2022 в 14:13 +0300, Yura Sokolov пишет:
>
> > Tests:
> > - tests done on 2 socket Xeon 5220 2.20GHz with turbo bust disabled
> > (ie max frequency is 2.20GHz)
>
> Forgot to mention:
> - this time it was Centos7.9.2009 (Core) with Linux mn10 3.10.0-1160.el7.x86_64
>
> Perhaps older kernel describes poor master's performance on 2 sockets
> compared to my previous results (when this server had Linux 5.10.103-1 Debian).
>
> Or there is degradation in PostgreSQL's master branch between.
> I'll try to check today.

No, old master commit ( 7e12256b47 Sat Mar 12 14:21:40 2022) behaves same.
So it is clearly old-kernel issue. Perhaps, futex was much slower than this
days.

The patch requires a rebase; please do that.

Hunk #1 FAILED at 231.
Hunk #2 succeeded at 409 (offset 82 lines).

1 out of 2 hunks FAILED -- saving rejects to file src/include/storage/buf_internals.h.rej

Ibrar Ahmed

Re: BufferAlloc: don't take two simultaneous locks

От

Michael Paquier

Дата:

12 октября 2022 г., 10:46:37

On Wed, Sep 07, 2022 at 12:53:07PM +0500, Ibrar Ahmed wrote:
> Hunk #1 FAILED at 231.
> Hunk #2 succeeded at 409 (offset 82 lines).
>
> 1 out of 2 hunks FAILED -- saving rejects to file
> src/include/storage/buf_internals.h.rej

With no rebase done since this notice, I have marked this entry as
RwF.
--
Michael

Вложения

signature.asc

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: BufferAlloc: don't take two simultaneous locks

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения