Lowering the default wal_blocksize to 4K

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Lowering the default wal_blocksize to 4K
Дата
Msg-id 20231009230805.funj5ipoggjyzjz6@awork3.anarazel.de
обсуждение исходный текст
Ответы Re: Lowering the default wal_blocksize to 4K  (Bruce Momjian <bruce@momjian.us>)
Re: Lowering the default wal_blocksize to 4K  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Lowering the default wal_blocksize to 4K  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Список pgsql-hackers
Hi,

I've mentioned this to a few people before, but forgot to start an actual
thread. So here we go:

I think we should lower the default wal_blocksize / XLOG_BLCKSZ to 4096, from
the current 8192.  The reason is that

a) We don't gain much from a blocksize above 4096, as we already do one write
   all the pending WAL data in one go (except when at the tail of
   wal_buffers). We *do* incur more overhead for page headers, but compared to
   the actual WAL data it is not a lot (~0.29% of space is page headers 8192
   vs 0.59% with 4096).

b) Writing 8KB when we we have to flush a partially filled buffer can
   substantially increase write amplification. In a transactional workload,
   this will often double the write volume.

Currently disks mostly have 4096 bytes as their "sector size". Sometimes
that's exposed directly, sometimes they can also write in 512 bytes, but that
internally requires a read-modify-write operation.


For some example numbers, I ran a very simple insert workload with a varying
number of clients with both a wal_blocksize=4096 and wal_blocksize=8192
cluster, and measured the amount of bytes written before/after.  The table was
recreated before each run, followed by a checkpoint and the benchmark. Here I
ran the inserts only for 15s each, because the results don't change
meaningfully with longer runs.


With XLOG_BLCKSZ=8192

clients         tps    disk bytes written
1         667         81296
2         739         89796
4        1446         89208
8        2858         90858
16        5775         96928
32       11920        115351
64       23686        135244
128       46001        173390
256       88833        239720
512      146208        335669


With XLOG_BLCKSZ=4096

clients         tps    disk bytes written
1         751         46838
2         773         47936
4        1512         48317
8        3143         52584
16        6221         59097
32       12863         73776
64       25652         98792
128       48274        133330
256       88969        200720
512      146298        298523


This is on a not-that-fast NVMe SSD (Samsung SSD 970 PRO 1TB).


It's IMO quite interesting that even at the higher client counts, the number
of bytes written don't reach parity.


On a stripe of two very fast SSDs:

With XLOG_BLCKSZ=8192

clients         tps    disk bytes written
1       23786        2893392
2       38515        4683336
4       63436        4688052
8      106618        4618760
16      177905        4384360
32      254890        3890664
64      297113        3031568
128      299878        2297808
256      308774        1935064
512      292515        1630408


With XLOG_BLCKSZ=4096

clients         tps    disk bytes written
1       25742        1586748
2       43578        2686708
4       62734        2613856
8      116217        2809560
16      200802        2947580
32      269268        2461364
64      323195        2042196
128      317160        1550364
256      309601        1285744
512      292063        1103816

It's fun to see how the total number of writes *decreases* at higher
concurrency, because it becomes more likely that pages are filled completely.


One thing I noticed is that our auto-configuration of wal_buffers leads to
different wal_buffers settings for different XLOG_BLCKSZ, which doesn't seem
great.


Performing the same COPY workload (1024 files, split across N clients) for
both settings shows no performance difference, but a very slight increase in
total bytes written (about 0.25%, which is roughly what I'd expect).


Personally I'd say the slight increase in WAL volume is more than outweighed
by the increase in throughput and decrease in bytes written.


There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: Proposal to use JSON for Postgres Parser format