Обсуждение: [Patch v2] Make block and file size for WAL and relations defined atcluster creation
[Patch v2] Make block and file size for WAL and relations defined atcluster creation
От
Remi Colinet
Дата:
Hello,
This is version 2 of the patch to make the file and block sizes for WAL and relations, run-time configurable at initdb.Recently, the value definition of the WAL file size has been converted from server build time to cluster creation time. The current patch goes further in this direction with the relation block and file sizes and with the WAL block size. And some more could be done with LOBLKSIZE for instance (TBD).
Names and values
- the WAL block size
- the WAL file size
- the relation block size
- the relation file size
I noticed that the names of these parameters can slightly vary throughout the source code, whether it is the name or the unit.
Such names are:
BLCKSZ: the relation block size in bytes
RELSEG_SIZE: maximum number of blocks allowed in one disk file
XLOG_BLCKSZ: the WAL block size in bytes
XLOG_SEG_SIZE: the WAL file size in bytes
blcksz (in control file): the relation block size in bytes (same as BLCKSZ)
relseg_size (in control file): the relation file size in blocks (same as RELSEG_SIZE)
xlog_blcksz (in control file): WAL block size in bytes (same as XLOG_BLCKSZ)
xlog_seg_size (in control file); the WAL file size in bytes (same as XLOG_SEG_SIZE)
RELSEG_SIZE: maximum number of blocks allowed in one disk file
XLOG_BLCKSZ: the WAL block size in bytes
XLOG_SEG_SIZE: the WAL file size in bytes
blcksz (in control file): the relation block size in bytes (same as BLCKSZ)
relseg_size (in control file): the relation file size in blocks (same as RELSEG_SIZE)
xlog_blcksz (in control file): WAL block size in bytes (same as XLOG_BLCKSZ)
xlog_seg_size (in control file); the WAL file size in bytes (same as XLOG_SEG_SIZE)
WalSegSz (in pg_resetwal.c): the WAL segment size in bytes
wal_segment_size (in xlog.c): the WAL segment size in bytes
segment_size (in guc.c): the relation segment size
For the current patch, I have defined common names to be used throughout in the source code, whether this in the server or in the different utilities with units in:
- bytes for the blocks sizes
- blocks for the files sizes
These are:
- wal_blck_size: which replaces XLOG_BLCKSZ
- wal_file_blck
- wal_file_size: which is wal_blck_size * wal_file_blck. It replaces XLOG_SEG_SIZE and wal_segment_size
- rel_blck_size: which replaces BLCKSZ
- rel_file_blck: it replaces RELSEG_SIZE and segment_size
- rel_file_size: which is rel_blck_size * rel_file_blck.
Lower case letters are used to remind that these values are not statically defined at compile time.
Patch
The patch is made only of small changes unless a few files which require some more work with palloc/pfree.
The most concerned files are:
src/backend/access/heap/pruneheap.c
src/backend/access/nbtree/nbtree.c
src/backend/access/nbtree/nbtsearch.c
src/backend/access/transam/generic_xlog.c
src/backend/access/transam/xlog.c
src/backend/nodes/tidbitmap.c
src/bin/initdb/initdb.c
tidbitmap.c is most concerned file because it includes the simplehash.h header.
But even for this file, the change is eventually straightforward.
Other affected files have tiny changes or changes which do not incur any hurdle.
The patch is built on top of commit 0772c152b9bd02baeca6920c3371fce95e8f13dc (Mon Nov 27 20:56:46 2017 -0500).
I will rebase with the latest version once I have completed all the initial tests with the different possible combinations of blocks and files sizes, both for the relations and the WAL files.
Justifications are:
- we may test different combinations of file and block sizes, for the relation and the WAL in order to have the better performances of the server.
- the same binary can be used on the same host with several databases instances/cluster, each using different values for block and file sizes.
- Linux distributions deliver Postgresql with a binary already compiled with the default values.
Regarding the cost of using run-time configurable values for file and block sizes of the WAL and relations, this cost is low both :
- from a developer point of view: the source code changes are spread in many files but only a few one have significant changes.
- from a run-time point of view. The overhead is only at the start of the database instance.
And moreover, the overhead is still very low at the start of the server, with only a few more dynamic memory allocations.
- we may test different combinations of file and block sizes, for the relation and the WAL in order to have the better performances of the server.
Avoiding a compilation for each combination of values seems to make sense.
This is what I did to test the patch. I have created about 20 different combinations of values for the file and block sizes of the relation and WAL files.
This means DBAs need to rebuild the binary for each combination of block and file sizes, whether this is for the WAL or the relations.
- Selecting the correct values for file and block sizes is a DBA task, and not a developer task. For instance, when someone wants to create a Linux filesystem with a given block size, he is not forced to accept a given value chosed by the developer of the file system driver when this later was compiled.
- The file and block sizes should depend mostly of the physical server and physical storage. Not on the database software itself.
- from a developer point of view: the source code changes are spread in many files but only a few one have significant changes.
Mainly the tidbitmap.c is concerned the change. Other changes are minor changes.
And moreover, the overhead is still very low at the start of the server, with only a few more dynamic memory allocations.
Test cases
Below combinations of values have been tested so far, by creating a cluster and filling a table with 10 to 200 millions rows.
WAL file and WAL block sizes were nopt changed so far.
rel_blck_size
--rel_blck_size=1024 --rel_file_blck=1048576 --wal_blck_size=8192 --wal_file_blck=32768 ok (1GB file relation files/ 256MB wal files)
--rel_blck_size=2048 --rel_file_blck=524288 --wal_blck_size=8192 --wal_file_blck=32768 ok (1GB file)
--rel_blck_size=4096 --rel_file_blck=262144 --wal_blck_size=8192 --wal_file_blck=32768 ok (1GB file)
--rel_blck_size=8192 --rel_file_blck=131072 --wal_blck_size=8192 --wal_file_blck=32768 ok (1GB file)
--rel_blck_size=16384 --rel_file_blck=65536 --wal_blck_size=8192 --wal_file_blck=32768 ok (1GB file)
--rel_blck_size=32768 --rel_file_blck=32768 --wal_blck_size=8192 --wal_file_blck=32768 ok (1GB file)
rel_file_blck
--rel_blck_size=8192 --rel_file_blck=262144 --wal_blck_size=8192 --wal_file_blck=32768 ok (2GB files)
--rel_blck_size=8192 --rel_file_blck=524288 --wal_blck_size=8192 --wal_file_blck=32768 ok (4GB files)
--rel_blck_size=8192 --rel_file_blck=1048576 --wal_blck_size=8192 --wal_file_blck=32768 ok (8GB files)
--rel_blck_size=1024 --rel_file_blck=1048576 --wal_blck_size=8192 --wal_file_blck=32768 ok (1GB file relation files/ 256MB wal files)
--rel_blck_size=2048 --rel_file_blck=524288 --wal_blck_size=8192 --wal_file_blck=32768 ok (1GB file)
--rel_blck_size=4096 --rel_file_blck=262144 --wal_blck_size=8192 --wal_file_blck=32768 ok (1GB file)
--rel_blck_size=8192 --rel_file_blck=131072 --wal_blck_size=8192 --wal_file_blck=32768 ok (1GB file)
--rel_blck_size=16384 --rel_file_blck=65536 --wal_blck_size=8192 --wal_file_blck=32768 ok (1GB file)
--rel_blck_size=32768 --rel_file_blck=32768 --wal_blck_size=8192 --wal_file_blck=32768 ok (1GB file)
rel_file_blck
--rel_blck_size=8192 --rel_file_blck=262144 --wal_blck_size=8192 --wal_file_blck=32768 ok (2GB files)
--rel_blck_size=8192 --rel_file_blck=524288 --wal_blck_size=8192 --wal_file_blck=32768 ok (4GB files)
--rel_blck_size=8192 --rel_file_blck=1048576 --wal_blck_size=8192 --wal_file_blck=32768 ok (8GB files)
Further tests are going to be done with different combinations of block and file sizes for the WAL.
To do
Convert large object compile time block and file size to run-time parameters
Further tests with different combinations of block and file sizes.
Rebase on latest commit of master Postgresql tree.
Remove debugging code introduced by the current patch
Split current patch into smaller chunks.
v2
Fixed bug in simplehash.h caused by &data[i] dereferences for which data is an array of PageTableentry.
Removed debugging code from tidbitmap.c and simplehash.h
Fixed REL_FILE_SIZE macros overflow warning caused by missing types in pg_control_def.h macros
Tests done with above cases
[root@rco v2]# diffstat blkfilesizes_v2.patch
TODO | 57 ++
configure.in | 94 ---
contrib/amcheck/verify_nbtree.c | 4
contrib/bloom/blinsert.c | 14
contrib/bloom/bloom.h | 26 -
contrib/bloom/blutils.c | 6
contrib/bloom/blvacuum.c | 6
contrib/file_fdw/file_fdw.c | 6
contrib/pageinspect/brinfuncs.c | 8
contrib/pageinspect/btreefuncs.c | 6
contrib/pageinspect/rawpage.c | 12
contrib/pg_prewarm/pg_prewarm.c | 4
contrib/pg_standby/pg_standby.c | 7
contrib/pgstattuple/pgstatapprox.c | 6
contrib/pgstattuple/pgstatindex.c | 4
contrib/pgstattuple/pgstattuple.c | 10
contrib/postgres_fdw/deparse.c | 2
contrib/postgres_fdw/postgres_fdw.c | 2
param.sh | 1
src/backend/access/brin/brin_pageops.c | 4
src/backend/access/common/bufmask.c | 4
src/backend/access/common/reloptions.c | 8
src/backend/access/gin/ginbtree.c | 12
src/backend/access/gin/gindatapage.c | 18
src/backend/access/gin/ginentrypage.c | 2
src/backend/access/gin/ginfast.c | 6
src/backend/access/gin/ginget.c | 6
src/backend/access/gin/ginvacuum.c | 2
src/backend/access/gin/ginxlog.c | 4
src/backend/access/gist/gistbuild.c | 8
src/backend/access/gist/gistbuildbuffers.c | 10
src/backend/access/gist/gistscan.c | 1
src/backend/access/hash/hash.c | 7
src/backend/access/hash/hashpage.c | 4
src/backend/access/heap/README.HOT | 2
src/backend/access/heap/heapam.c | 17
src/backend/access/heap/pruneheap.c | 39 +
src/backend/access/heap/rewriteheap.c | 4
src/backend/access/heap/syncscan.c | 2
src/backend/access/heap/visibilitymap.c | 8
src/backend/access/nbtree/nbtpage.c | 2
src/backend/access/nbtree/nbtree.c | 18
src/backend/access/nbtree/nbtsearch.c | 5
src/backend/access/nbtree/nbtsort.c | 10
src/backend/access/spgist/spgdoinsert.c | 4
src/backend/access/spgist/spginsert.c | 2
src/backend/access/spgist/spgscan.c | 1
src/backend/access/spgist/spgtextproc.c | 10
src/backend/access/spgist/spgutils.c | 4
src/backend/access/transam/README | 2
src/backend/access/transam/clog.c | 10
src/backend/access/transam/commit_ts.c | 4
src/backend/access/transam/generic_xlog.c | 44 +
src/backend/access/transam/multixact.c | 12
src/backend/access/transam/slru.c | 22
src/backend/access/transam/subtrans.c | 5
src/backend/access/transam/timeline.c | 2
src/backend/access/transam/twophase.c | 2
src/backend/access/transam/xlog.c | 603 ++++++++++++++----------
src/backend/access/transam/xlogarchive.c | 12
src/backend/access/transam/xlogfuncs.c | 10
src/backend/access/transam/xloginsert.c | 48 +
src/backend/access/transam/xlogreader.c | 141 +++--
src/backend/access/transam/xlogutils.c | 34 -
src/backend/bootstrap/bootstrap.c | 33 -
src/backend/commands/async.c | 15
src/backend/commands/tablecmds.c | 2
src/backend/commands/vacuumlazy.c | 4
src/backend/executor/execGrouping.c | 1
src/backend/nodes/tidbitmap.c | 135 ++++-
src/backend/optimizer/path/costsize.c | 10
src/backend/optimizer/util/plancat.c | 2
src/backend/postmaster/checkpointer.c | 4
src/backend/replication/basebackup.c | 30 -
src/backend/replication/logical/logical.c | 2
src/backend/replication/logical/reorderbuffer.c | 18
src/backend/replication/slot.c | 2
src/backend/replication/walreceiver.c | 14
src/backend/replication/walreceiverfuncs.c | 4
src/backend/replication/walsender.c | 30 -
src/backend/storage/buffer/buf_init.c | 4
src/backend/storage/buffer/bufmgr.c | 8
src/backend/storage/buffer/freelist.c | 6
src/backend/storage/buffer/localbuf.c | 6
src/backend/storage/file/buffile.c | 20
src/backend/storage/file/copydir.c | 2
src/backend/storage/freespace/README | 8
src/backend/storage/freespace/freespace.c | 36 -
src/backend/storage/freespace/indexfsm.c | 7
src/backend/storage/lmgr/predicate.c | 2
src/backend/storage/page/bufpage.c | 27 -
src/backend/storage/smgr/md.c | 104 ++--
src/backend/tcop/postgres.c | 2
src/backend/utils/adt/selfuncs.c | 2
src/backend/utils/init/globals.c | 20
src/backend/utils/init/miscinit.c | 6
src/backend/utils/init/postinit.c | 23
src/backend/utils/misc/guc.c | 175 ++++--
src/backend/utils/misc/pg_controldata.c | 4
src/backend/utils/sort/logtape.c | 49 -
src/backend/utils/sort/tuplesort.c | 6
src/bin/initdb/initdb.c | 305 +++++++++---
src/bin/pg_basebackup/pg_basebackup.c | 18
src/bin/pg_basebackup/pg_receivewal.c | 26 -
src/bin/pg_basebackup/pg_recvlogical.c | 11
src/bin/pg_basebackup/receivelog.c | 28 -
src/bin/pg_basebackup/streamutil.c | 76 +--
src/bin/pg_basebackup/streamutil.h | 6
src/bin/pg_basebackup/walmethods.c | 14
src/bin/pg_controldata/pg_controldata.c | 16
src/bin/pg_resetwal/pg_resetwal.c | 125 +++-
src/bin/pg_rewind/copy_fetch.c | 9
src/bin/pg_rewind/filemap.c | 11
src/bin/pg_rewind/libpq_fetch.c | 7
src/bin/pg_rewind/parsexlog.c | 26 -
src/bin/pg_rewind/pg_rewind.c | 33 -
src/bin/pg_test_fsync/pg_test_fsync.c | 71 +-
src/bin/pg_upgrade/controldata.c | 7
src/bin/pg_upgrade/file.c | 15
src/bin/pg_upgrade/pg_upgrade.c | 3
src/bin/pg_waldump/pg_waldump.c | 69 +-
src/common/controldata_utils.c | 98 +++
src/include/access/brin_page.h | 2
src/include/access/ginblock.h | 6
src/include/access/gist_private.h | 20
src/include/access/hash.h | 5
src/include/access/htup_details.h | 11
src/include/access/itup.h | 2
src/include/access/nbtree.h | 10
src/include/access/relscan.h | 7
src/include/access/slru.h | 2
src/include/access/spgist_private.h | 22
src/include/access/tuptoaster.h | 2
src/include/access/xlog_internal.h | 8
src/include/access/xlogreader.h | 9
src/include/access/xlogrecord.h | 6
src/include/common/controldata_utils.h | 4
src/include/lib/simplehash.h | 65 +-
src/include/nodes/execnodes.h | 1
src/include/nodes/nodes.h | 1
src/include/pg_config.h.in | 31 -
src/include/pg_config_manual.h | 8
src/include/pg_control_def.h | 44 +
src/include/storage/bufmgr.h | 4
src/include/storage/bufpage.h | 5
src/include/storage/checksum_impl.h | 2
src/include/storage/fsm_internals.h | 5
src/include/storage/large_object.h | 4
src/include/storage/md.h | 12
src/include/storage/off.h | 2
src/include/utils/rel.h | 4
src/interfaces/libpq/libpq-int.h | 5
152 files changed, 2257 insertions(+), 1299 deletions(-)
[root@rco v2]#
Вложения
Re: [Patch v2] Make block and file size for WAL and relationsdefined at cluster creation
От
Alvaro Herrera
Дата:
Remi Colinet wrote: > Hello, > > This is version 2 of the patch to make the file and block sizes for WAL and > relations, run-time configurable at initdb. I don't think this works, since we have a rule that pallocs are prohibited within critical section and I see that your patch changes some stack-allocated variables to palloc'ed. For example I think the heap_page_prune changes should break some test or other. This patch is too massive to review. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Patch v2] Make block and file size for WAL and relations definedat cluster creation
От
Remi Colinet
Дата:
2018-01-03 23:11 GMT+01:00 Alvaro Herrera <alvherre@alvh.no-ip.org>:
Remi Colinet wrote:
> Hello,
>
> This is version 2 of the patch to make the file and block sizes for WAL and
> relations, run-time configurable at initdb.
I don't think this works, since we have a rule that pallocs are
prohibited within critical section and I see that your patch changes
some stack-allocated variables to palloc'ed. For example I think the
heap_page_prune changes should break some test or other.
Thank you for the head up.
For heap_page_prune() function, the critical section starts after the palloc() call and ends before the pfree().
Unless critical sections can be nested, we are outside such section.
For the other palloc()/pfree() uses to replace the stack allocation, either we already have palloc()/pfree() call.
The changes consist of:
- page = (Page) palloc(BLCKSZ);
+ page = (Page) palloc(rel_blck_size);
+ page = (Page) palloc(rel_blck_size);
Only one change could be suspected. This is for the async.c command.
But the change is also done outside of a critical section.
This patch is too massive to review.
I understand the point.
If the patch is clean enough and does not show any regression, I will split it into smaller parts.
Regards
Remi
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services