Обсуждение: Block / Page Size Optimization

Поиск
Список
Период
Сортировка

Block / Page Size Optimization

От
Gunther
Дата:
Hi all, I am sure this should be a FAQ, but I can't see a definitive 
answer, only chatter on various lists and forums.

Default page size of PostgreSQL is 8192 bytes.

Default IO block size in Linux is 4096 bytes.

I can set an XFS file system with 8192 bytes block size, but then it 
does not mount on Linux, because the VM page size is the limit, 4096 again.

There seems to be no way to change that in (most, common) Linux 
variants. In FreeBSD there appears to be a way to change that.

But then, there is a hardware limit also, as far as the VM memory page 
allocation is concerned. Apparently most i386 / amd64 architectures the 
VM page sizes are 4k, 2M, and 1G. The latter, I believe, are called 
"hugepages" and I only ever see that discussed in the PostgreSQL manuals 
for Linux, not for FreeBSD.

People have asked: does it matter? And then there is all that chatter 
about "why don't you run a benchmark and report back to us" -- "OK, will 
do" -- and then it's crickets.

But why is this such a secret?

On Amazon AWS there is the following very simple situation: IO is capped 
on IO operations per second (IOPS). Let's say, on a smallish volume, I 
get 300 IOPS (once my burst balance is used up.)

Now my simple theoretical reasoning is this: one IO call transfers 1 
block of 4k size. That means, with a cap of 300 IOPS, I get to send 1.17 
MB per second. That would be the absolute limit. BUT, if I could double 
the transfer size to 8k, I should be able to move 2.34 MB per second. 
Shouldn't I?

That might well depend on whether AWS' virtual device paths would 
support these 8k block sizes.

But something tells me that my reasoning here is totally off. Because I 
get better IO throughput that that. Even on 3000 IOPS I would only get 
11 MB per second, and I am sure I am getting rather 50-100 MB/s, no? So 
my simplistic logic is false.

What really is the theoretical issue with the file system block size? 
Where does -- in theory -- the benefit come from of using an XFS block 
size of 8 kB, or even increasing the PostgreSQL page size to 16 kB and 
then the XFS block size also to 16 kB? I remember having seen standard 
UFS block sizes of 16 kB. But then why is Linux so tough on refusing to 
mount an 8 kB XFS because it's VM page size is only 4 kB?

Doesn't this all have one straight explanation?

If you have a link that I can just read, I appreciate you sharing that. 
I think that should be on some Wiki or FAQ somewhere. If I get a quick 
and dirty explanation with some pointers, I can try to write it out into 
a more complete answer that might be added into some documentation or 
FAQ somewhere.

thanks & regards,
-Gunther




Re: Block / Page Size Optimization

От
Andres Freund
Дата:
Hi,

On 2019-04-08 11:09:07 -0400, Gunther wrote:
> I can set an XFS file system with 8192 bytes block size, but then it does
> not mount on Linux, because the VM page size is the limit, 4096 again.
> 
> There seems to be no way to change that in (most, common) Linux variants. In
> FreeBSD there appears to be a way to change that.
> 
> But then, there is a hardware limit also, as far as the VM memory page
> allocation is concerned. Apparently most i386 / amd64 architectures the VM
> page sizes are 4k, 2M, and 1G. The latter, I believe, are called "hugepages"
> and I only ever see that discussed in the PostgreSQL manuals for Linux, not
> for FreeBSD.
> 
> People have asked: does it matter? And then there is all that chatter about
> "why don't you run a benchmark and report back to us" -- "OK, will do" --
> and then it's crickets.
> 
> But why is this such a secret?
> 
> On Amazon AWS there is the following very simple situation: IO is capped on
> IO operations per second (IOPS). Let's say, on a smallish volume, I get 300
> IOPS (once my burst balance is used up.)
> 
> Now my simple theoretical reasoning is this: one IO call transfers 1 block
> of 4k size. That means, with a cap of 300 IOPS, I get to send 1.17 MB per
> second. That would be the absolute limit. BUT, if I could double the
> transfer size to 8k, I should be able to move 2.34 MB per second. Shouldn't
> I?

The kernel collapses consecutive write requests. You can see the
average sizes of IO requests using iostat -xm 1. When e.g. bulk loading
into postgres I see:

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz
wareq-sz svctm  %util
 
sda              4.00  696.00      0.02    471.05     0.00    80.00   0.00  10.31    8.50    7.13   4.64     4.00
693.03  0.98  68.50
 

so the average write request size was 693.03 kb. Thus I got 470 MB/sec
despite there only being ~700 IOPS. That's with 4KB page sizes, 4KB FS
blocks, and 8KB postgres  block size.


There still might be some benefit of different FS block sizes, but it's
not going to be related directly to IOPS.

Greetings,

Andres Freund



Re: Block / Page Size Optimization

От
Tomas Vondra
Дата:
On Mon, Apr 08, 2019 at 11:09:07AM -0400, Gunther wrote:
>Hi all, I am sure this should be a FAQ, but I can't see a definitive 
>answer, only chatter on various lists and forums.
>
>Default page size of PostgreSQL is 8192 bytes.
>
>Default IO block size in Linux is 4096 bytes.
>
>I can set an XFS file system with 8192 bytes block size, but then it 
>does not mount on Linux, because the VM page size is the limit, 4096 
>again.
>
>There seems to be no way to change that in (most, common) Linux 
>variants. In FreeBSD there appears to be a way to change that.
>
>But then, there is a hardware limit also, as far as the VM memory page 
>allocation is concerned. Apparently most i386 / amd64 architectures 
>the VM page sizes are 4k, 2M, and 1G. The latter, I believe, are 
>called "hugepages" and I only ever see that discussed in the 
>PostgreSQL manuals for Linux, not for FreeBSD.
>

You're mixing page sizes at three different levels

1) memory (usually 4kB on x86, although we now have hugepages too)

2) filesystem (generally needs to be smaller than memory page, at least
for native filesystems, 4kB by default for most filesystems on x86)

3) database (8kB by default)

Then there's also the "hardware page" (sectors) which used to be 512B,
then it got increased to 4kB, and then SSDs entirely changed how all
that works and it's quite specific to individual devices / models.

Of course, the exact behavior depends on sizes used at each level, and
it may interfere in unexpected ways.

>People have asked: does it matter? And then there is all that chatter 
>about "why don't you run a benchmark and report back to us" -- "OK, 
>will do" -- and then it's crickets.
>
>But why is this such a secret?
>

What is a secret? That I/O request size affects performance? That's
pretty obvious fact, I think. Years ago I did exactly that kind of
benchmark, and the results are just as expected - smaller pages are
better for random I/O, larger pages are better for sequential access.
Essentially, throughput vs. latency kind of trade-off.

The other thing of course is that page size affects how adaptive the
cache can be - even if you keep the cache size the same, doubling the
page size means you only have 1/2 of "slots" that you used to have. So
you're more likely to evict stuff that you'll need soon, negatively
affecting the cache hit ratio.

OTOH if you decrease the page size, you increase the "overhead" fraction
(because each page has a fixed-size header). So while you get more
slots, a bigger fraction will be used for this metadata.

In practice, it probably does not matter much whether you have 4kB, 8kB
or 16kB pages. It will make a difference for some workloads, especially
if you align the sizes to e.g. match SSD page sizes etc.

But frankly, there are probably better/cheaper ways to achieve the same
benefits. And it's usually the case that systems are a mix of workloads
and what improves one is bad for another one.

>On Amazon AWS there is the following very simple situation: IO is 
>capped on IO operations per second (IOPS). Let's say, on a smallish 
>volume, I get 300 IOPS (once my burst balance is used up.)
>
>Now my simple theoretical reasoning is this: one IO call transfers 1 
>block of 4k size. That means, with a cap of 300 IOPS, I get to send 
>1.17 MB per second. That would be the absolute limit. BUT, if I could 
>double the transfer size to 8k, I should be able to move 2.34 MB per 
>second. Shouldn't I?
>

Ummm, I'm no expert on Amazon, but AFAIK the I/O limits are specified
assuming requests of a specific size (16kB IIRC). So doubling the I/O
request size may not actually help much, the throughput limit will
remain the same.

>That might well depend on whether AWS' virtual device paths would 
>support these 8k block sizes.
>
>But something tells me that my reasoning here is totally off. Because 
>I get better IO throughput that that. Even on 3000 IOPS I would only 
>get 11 MB per second, and I am sure I am getting rather 50-100 MB/s, 
>no? So my simplistic logic is false.
>

There's a difference between guaranteed and actual throughput. If you
run the workload long enough, chances are the numbers will go down.

>What really is the theoretical issue with the file system block size? 
>Where does -- in theory -- the benefit come from of using an XFS block 
>size of 8 kB, or even increasing the PostgreSQL page size to 16 kB and 
>then the XFS block size also to 16 kB? I remember having seen standard 
>UFS block sizes of 16 kB. But then why is Linux so tough on refusing 
>to mount an 8 kB XFS because it's VM page size is only 4 kB?
>
>Doesn't this all have one straight explanation?
>

Not really. AFAICS the limitation is due to a mix of reasons, and is
mostly a trade-off between code complexity and potential benefits. It's
probably simpler to manage filesystems with pages smaller than a memory
page, and removing the limitation did not seem very useful compared to
the added complexity. But it's probably a question for kernel hackers.

>If you have a link that I can just read, I appreciate you sharing 
>that. I think that should be on some Wiki or FAQ somewhere. If I get a 
>quick and dirty explanation with some pointers, I can try to write it 
>out into a more complete answer that might be added into some 
>documentation or FAQ somewhere.
>

Maybe read this famous paper by Jim Gray & Franco Putzolu. It's not
exactly about the thing you're asking about, but it's related. It
essentially deals with sizing memory vs. disk I/O, and page size plays
an important role in that too.

[1] https://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Block / Page Size Optimization

От
Bruce Momjian
Дата:
On Mon, Apr 15, 2019 at 06:19:06PM +0200, Tomas Vondra wrote:
> On Mon, Apr 08, 2019 at 11:09:07AM -0400, Gunther wrote:
> > What really is the theoretical issue with the file system block size?
> > Where does -- in theory -- the benefit come from of using an XFS block
> > size of 8 kB, or even increasing the PostgreSQL page size to 16 kB and
> > then the XFS block size also to 16 kB? I remember having seen standard
> > UFS block sizes of 16 kB. But then why is Linux so tough on refusing to
> > mount an 8 kB XFS because it's VM page size is only 4 kB?
> > 
> > Doesn't this all have one straight explanation?
> > 
> 
> Not really. AFAICS the limitation is due to a mix of reasons, and is
> mostly a trade-off between code complexity and potential benefits. It's
> probably simpler to manage filesystems with pages smaller than a memory
> page, and removing the limitation did not seem very useful compared to
> the added complexity. But it's probably a question for kernel hackers.

My guess is that having the file system block size be the same as the
virtual memory page size allows flipping pages from kernel to userspace
memory much simpler.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +