Обсуждение: Revisiting disk layout on ZFS systems

Поиск
Список
Период
Сортировка

Revisiting disk layout on ZFS systems

От
Karl Denninger
Дата:
I've been doing a bit of benchmarking and real-world performance
testing, and have found some curious results.

The load in question is a fairly-busy machine hosting a web service that
uses Postgresql as its back end.

"Conventional Wisdom" is that you want to run an 8k record size to match
Postgresql's inherent write size for the database.

However, operational experience says this may no longer be the case now
that modern ZFS systems support LZ4 compression, because modern CPUs can
compress fast enough that they overrun raw I/O capacity. This in turn
means that the recordsize is no longer the record size, basically, and
Postgresql's on-disk file format is rather compressible -- indeed, in
actual performance on my dataset it appears to be roughly 1.24x which is
nothing to sneeze at.

The odd thing is that I am getting better performance with a 128k record
size on this application than I get with an 8k one!  Not only is the
system faster to respond subjectively and can it sustain a higher TPS
load objectively but the I/O busy percentage as measured during
operation is MARKEDLY lower (by nearly an order of magnitude!)

This is not expected behavior!

What I am curious about, however, is the xlog -- that appears to suffer
pretty badly from 128k record size, although it compresses even
more-materially; 1.94x (!)

The files in the xlog directory are large (16MB each) and thus "first
blush" would be that having a larger record size for that storage area
would help.  It appears that instead it hurts.

Ideas?

--
-- Karl
karl@denninger.net



Вложения

Re: Revisiting disk layout on ZFS systems

От
Heikki Linnakangas
Дата:
On 04/28/2014 06:47 PM, Karl Denninger wrote:
> What I am curious about, however, is the xlog -- that appears to suffer
> pretty badly from 128k record size, although it compresses even
> more-materially; 1.94x (!)
>
> The files in the xlog directory are large (16MB each) and thus "first
> blush" would be that having a larger record size for that storage area
> would help.  It appears that instead it hurts.

The WAL is fsync'd frequently. My guess is that that causes a lot of
extra work to repeatedly recompress the same data, or something like that.

- Heikki


Re: Revisiting disk layout on ZFS systems

От
Karl Denninger
Дата:
On 4/28/2014 1:04 PM, Heikki Linnakangas wrote:
> On 04/28/2014 06:47 PM, Karl Denninger wrote:
>> What I am curious about, however, is the xlog -- that appears to suffer
>> pretty badly from 128k record size, although it compresses even
>> more-materially; 1.94x (!)
>>
>> The files in the xlog directory are large (16MB each) and thus "first
>> blush" would be that having a larger record size for that storage area
>> would help.  It appears that instead it hurts.
>
> The WAL is fsync'd frequently. My guess is that that causes a lot of
> extra work to repeatedly recompress the same data, or something like
> that.
>
> - Heikki
>
It shouldn't as ZFS re-writes on change, and what's showing up is not
high I/O *count* but rather percentage-busy, which implies lots of head
movement (that is, lots of sub-allocation unit writes.)

Isn't WAL essentially sequential writes during normal operation?

--
-- Karl
karl@denninger.net



Вложения

Re: Revisiting disk layout on ZFS systems

От
Heikki Linnakangas
Дата:
On 04/28/2014 09:07 PM, Karl Denninger wrote:
>> The WAL is fsync'd frequently. My guess is that that causes a lot of
>> extra work to repeatedly recompress the same data, or something like
>> that.
>
> It shouldn't as ZFS re-writes on change, and what's showing up is not
> high I/O*count*  but rather percentage-busy, which implies lots of head
> movement (that is, lots of sub-allocation unit writes.)

That sounds consistent frequent fsyncs.

> Isn't WAL essentially sequential writes during normal operation?

Yes, it's totally sequential. But it's fsync'd at every commit, which
means a lot of small writes.

- Heikki


Re: Revisiting disk layout on ZFS systems

От
Jeff Janes
Дата:
On Mon, Apr 28, 2014 at 11:07 AM, Karl Denninger <karl@denninger.net> wrote:

On 4/28/2014 1:04 PM, Heikki Linnakangas wrote:
On 04/28/2014 06:47 PM, Karl Denninger wrote:
What I am curious about, however, is the xlog -- that appears to suffer
pretty badly from 128k record size, although it compresses even
more-materially; 1.94x (!)

The files in the xlog directory are large (16MB each) and thus "first
blush" would be that having a larger record size for that storage area
would help.  It appears that instead it hurts.

The WAL is fsync'd frequently. My guess is that that causes a lot of extra work to repeatedly recompress the same data, or something like that.

- Heikki

It shouldn't as ZFS re-writes on change, and what's showing up is not high I/O *count* but rather percentage-busy, which implies lots of head movement (that is, lots of sub-allocation unit writes.)

Isn't WAL essentially sequential writes during normal operation?

Only if you have some sort of non-volatile intermediary, or are willing to risk your data integrity.  Otherwise, the fsync nature trumps the sequential nature.

Cheers,

Jeff


Re: Revisiting disk layout on ZFS systems

От
Karl Denninger
Дата:
On 4/28/2014 1:22 PM, Heikki Linnakangas wrote:
> On 04/28/2014 09:07 PM, Karl Denninger wrote:
>>> The WAL is fsync'd frequently. My guess is that that causes a lot of
>>> extra work to repeatedly recompress the same data, or something like
>>> that.
>>
>> It shouldn't as ZFS re-writes on change, and what's showing up is not
>> high I/O*count*  but rather percentage-busy, which implies lots of head
>> movement (that is, lots of sub-allocation unit writes.)
>
> That sounds consistent frequent fsyncs.
>
>> Isn't WAL essentially sequential writes during normal operation?
>
> Yes, it's totally sequential. But it's fsync'd at every commit, which
> means a lot of small writes.
>
> - Heikki

Makes sense; I'll muse on whether there's a way to optimize this
further... I'm not running into performance problems at present but I'd
rather be ahead of it....

--
-- Karl
karl@denninger.net



Вложения

Re: Revisiting disk layout on ZFS systems

От
Karl Denninger
Дата:

On 4/28/2014 1:26 PM, Jeff Janes wrote:
On Mon, Apr 28, 2014 at 11:07 AM, Karl Denninger <karl@denninger.net> wrote:


Isn't WAL essentially sequential writes during normal operation?

Only if you have some sort of non-volatile intermediary, or are willing to risk your data integrity.  Otherwise, the fsync nature trumps the sequential nature.

That would be a "no" on the data integrity :-)

-- 
-- Karl
karl@denninger.net
Вложения

Re: Revisiting disk layout on ZFS systems

От
Albe Laurenz
Дата:
Karl Denninger wrote:
> I've been doing a bit of benchmarking and real-world performance
> testing, and have found some curious results.

[...]

> The odd thing is that I am getting better performance with a 128k record
> size on this application than I get with an 8k one!

[...]

> What I am curious about, however, is the xlog -- that appears to suffer
> pretty badly from 128k record size, although it compresses even
> more-materially; 1.94x (!)
> 
> The files in the xlog directory are large (16MB each) and thus "first
> blush" would be that having a larger record size for that storage area
> would help.  It appears that instead it hurts.

As has been explained, the access patterns for WAL are quite different.

For your experiment, I'd keep them on different file systems so that
you can tune them independently.

Yours,
Laurenz Albe

Re: Revisiting disk layout on ZFS systems

От
Karl Denninger
Дата:
On 4/29/2014 3:13 AM, Albe Laurenz wrote:
> Karl Denninger wrote:
>> I've been doing a bit of benchmarking and real-world performance
>> testing, and have found some curious results.
> [...]
>
>> The odd thing is that I am getting better performance with a 128k record
>> size on this application than I get with an 8k one!
> [...]
>
>> What I am curious about, however, is the xlog -- that appears to suffer
>> pretty badly from 128k record size, although it compresses even
>> more-materially; 1.94x (!)
>>
>> The files in the xlog directory are large (16MB each) and thus "first
>> blush" would be that having a larger record size for that storage area
>> would help.  It appears that instead it hurts.
> As has been explained, the access patterns for WAL are quite different.
>
> For your experiment, I'd keep them on different file systems so that
> you can tune them independently.
>
They're on physically-different packs (pools and groups of spindles) as
that has been best practice for performance reasons pretty-much always
-- I just thought it was interesting, and worth noting, that the usual
recommendation to run an 8k record size for the data store itself may no
longer be valid.

It certainly isn't with my workload.

--
-- Karl
karl@denninger.net



Вложения

Re: Revisiting disk layout on ZFS systems

От
Josh Berkus
Дата:
On 04/28/2014 08:47 AM, Karl Denninger wrote:
> The odd thing is that I am getting better performance with a 128k record
> size on this application than I get with an 8k one!  Not only is the
> system faster to respond subjectively and can it sustain a higher TPS
> load objectively but the I/O busy percentage as measured during
> operation is MARKEDLY lower (by nearly an order of magnitude!)

Thanks for posting your experience!  I'd love it even more if you could
post some numbers to go with.

Questions:

1) is your database (or the active portion thereof) smaller than RAM?

2) is this a DW workload, where most writes are large writes?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com