Обсуждение: ZFS prefetch considered evil?

Поиск
Список
Период
Сортировка

ZFS prefetch considered evil?

От
Yaroslav Tykhiy
Дата:
Hi All,

I have a mid-size database (~300G) used as an email store and running
on a FreeBSD + ZFS combo.  Its PG_DATA is on ZFS whilst xlog goes to a
different FFS disk.  ZFS prefetch was enabled by default and disk time
on PG_DATA was near 100% all the time with transfer rates heavily
biased to read: ~50-100M/s read vs ~2-5M/s write.  A former
researcher, I was going to set up disk performance monitoring to
collect some history and see if disabling prefetch would have any
effect, but today I had to find out the difference the hard way.
Sorry, but that's why the numbers I can provide are quite approximate.

Due to a peak in user activity the server just melted down, with mail
data queries taking minutes to execute.  As the last resort, I
rebooted the server with ZFS prefetch disabled -- it couldn't be
disabled at run time in FreeBSD.  Now IMAP feels much more responsive;
transfer rates on PG_DATA are mostly <10M/s read and 1-2M/s write; and
disk time stays way below 100% unless a bunch of email is being
inserted.

My conclusion is that although ZFS prefetch is supposed to be adaptive
and handle random access more or less OK, in reality there is plenty
of room for improvement, so to speak, and for now Postgresql
performance can benefit from its staying just disabled.  The same may
apply to other database systems as well.

Thanks,
Yar

Re: ZFS prefetch considered evil?

От
Alban Hertroys
Дата:
On Jul 8, 2009, at 2:50 AM, Yaroslav Tykhiy wrote:

> Hi All,
>
> I have a mid-size database (~300G) used as an email store and
> running on a FreeBSD + ZFS combo.  Its PG_DATA is on ZFS whilst xlog
> goes to a different FFS disk.  ZFS prefetch was enabled by default
> and disk time on PG_DATA was near 100% all the time with transfer
> rates heavily biased to read: ~50-100M/s read vs ~2-5M/s write.  A
> former researcher, I was going to set up disk performance monitoring
> to collect some history and see if disabling prefetch would have any
> effect, but today I had to find out the difference the hard way.
> Sorry, but that's why the numbers I can provide are quite approximate.
>
> Due to a peak in user activity the server just melted down, with
> mail data queries taking minutes to execute.  As the last resort, I
> rebooted the server with ZFS prefetch disabled -- it couldn't be
> disabled at run time in FreeBSD.  Now IMAP feels much more
> responsive; transfer rates on PG_DATA are mostly <10M/s read and
> 1-2M/s write; and disk time stays way below 100% unless a bunch of
> email is being inserted.
>
> My conclusion is that although ZFS prefetch is supposed to be
> adaptive and handle random access more or less OK, in reality there
> is plenty of room for improvement, so to speak, and for now
> Postgresql performance can benefit from its staying just disabled.
> The same may apply to other database systems as well.


Are you sure you weren't hitting swap? IIRC prefetch tries to keep
data (disk blocks?) in memory that it fetched recently. ZFS uses quite
a bit of memory, so if you distributed all your memory to be used by
just postgres and disk cache then you didn't leave enough space for
the prefetch data and _something_ will be moved to swap.

If you're running FreeBSD i386 then ZFS requires some careful tuning
due to the limits a 32-bit OS puts on memory. I recall ZFS not being
very stable on i386 a while ago for those reasons, which has by now
been fixed as far as possible, but it's not ideal (and it likely never
will be).

You'll probably want to ask about this on the FreeBSD mailing lists as
well, they'll know much better than I do ;)

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4a54776e10131807247821!



Re: ZFS prefetch considered evil?

От
Yaroslav Tykhiy
Дата:
On 08/07/2009, at 8:39 PM, Alban Hertroys wrote:

> On Jul 8, 2009, at 2:50 AM, Yaroslav Tykhiy wrote:
>
>> Hi All,
>>
>> I have a mid-size database (~300G) used as an email store and
>> running on a FreeBSD + ZFS combo.  Its PG_DATA is on ZFS whilst
>> xlog goes to a different FFS disk.  ZFS prefetch was enabled by
>> default and disk time on PG_DATA was near 100% all the time with
>> transfer rates heavily biased to read: ~50-100M/s read vs ~2-5M/s
>> write.  A former researcher, I was going to set up disk performance
>> monitoring to collect some history and see if disabling prefetch
>> would have any effect, but today I had to find out the difference
>> the hard way.  Sorry, but that's why the numbers I can provide are
>> quite approximate.
>>
>> Due to a peak in user activity the server just melted down, with
>> mail data queries taking minutes to execute.  As the last resort, I
>> rebooted the server with ZFS prefetch disabled -- it couldn't be
>> disabled at run time in FreeBSD.  Now IMAP feels much more
>> responsive; transfer rates on PG_DATA are mostly <10M/s read and
>> 1-2M/s write; and disk time stays way below 100% unless a bunch of
>> email is being inserted.
>>
>> My conclusion is that although ZFS prefetch is supposed to be
>> adaptive and handle random access more or less OK, in reality there
>> is plenty of room for improvement, so to speak, and for now
>> Postgresql performance can benefit from its staying just disabled.
>> The same may apply to other database systems as well.
>
>
> Are you sure you weren't hitting swap?

A sceptic myself, I genuinely understand your doubt.  But this time I
was sure because I paid attention to the name of the device involved.
Moreover, a thrashing system wouldn't have had such a disparity
between disk read and write rates.

> IIRC prefetch tries to keep data (disk blocks?) in memory that it
> fetched recently.

What you described is just a disk cache.  And a trivial implementation
of prefetch would work as follows:  An application or other file/disk
consumer asks the provider (driver, kernel, whatever) to read, say, 2
disk blocks worth of data.  The provider thinks, "I know you are short-
sighted; I bet you are going to ask for more contiguous blocks very
soon," so it schedules a disk read for many more contiguous blocks
than requested and caches them in RAM.  For bulk data applications
such as file serving this trick works as a charm.  But other
applications do truly random access and they never come back after the
prefetched blocks; in this case both disk bandwidth and cache space
are wasted.  An advanced implementation can try to distinguish
sequential and random access patterns, but in reality it appears to be
a challenging task.

> ZFS uses quite a bit of memory, so if you distributed all your
> memory to be used by just postgres and disk cache then you didn't
> leave enough space for the prefetch data and _something_ will be
> moved to swap.

I hope you know that FreeBSD is exceptionally good at distributing
available memory between its consumers.  That said, useless prefetch
indeed puts extra pressure on disk cache and results in unnecessary
cache evictions, thus making things even worse.  It is true that ZFS
is memory hungry and so rather sensitive to non-optimal memory use
patterns.  Useless prefetch wastes memory that could be used to speed
up other ZFS operations.

> If you're running FreeBSD i386 then ZFS requires some careful tuning
> due to the limits a 32-bit OS puts on memory. I recall ZFS not being
> very stable on i386 a while ago for those reasons, which has by now
> been fixed as far as possible, but it's not ideal (and it likely
> never will be).

I use FreeBSD/amd64 and I'm generally happy with ZFS on that platform.

> You'll probably want to ask about this on the FreeBSD mailing lists
> as well, they'll know much better than I do ;)

Are you a local FreeBSD expert? ;-)  Jokes apart, I don't think this
topic has to do with FreeBSD as such; it is mostly about making the
advanced technologies of Postgresql and ZFS go well together.  Even
ZFS developers admit that in database related applications exceptions
from general ZFS practices and rules may be called for.

When I set up my next ZFS based Postgresql server, I think I'll play
with the recordsize property of ZFS and see if setting it to PAGESIZE
makes any difference.

Thanks,

Yar

Re: ZFS prefetch considered evil?

От
Scott Marlowe
Дата:
On Wed, Jul 8, 2009 at 7:53 PM, Yaroslav Tykhiy<yar@barnet.com.au> wrote:

> Are you a local FreeBSD expert? ;-)  Jokes apart, I don't think this topic
> has to do with FreeBSD as such; it is mostly about making the advanced
> technologies of Postgresql and ZFS go well together.  Even ZFS developers
> admit that in database related applications exceptions from general ZFS
> practices and rules may be called for.

That may or may not be true.  What other OSes have ZFS someone could
try to duplicate your results and then test fixes for them?  I run
various linux flavors, but would be more than willing to repurpose a
test server for pgsql testing freebsd.  Got an 8 disk HW RAID machine
due in in a month I could test on for a few days.  It appears the only
way to run it under linux (my primary OS) is with fuse.  But I'm
willing to try it there too.

Re: ZFS prefetch considered evil?

От
Greg Smith
Дата:
On Wed, 8 Jul 2009, Yaroslav Tykhiy wrote:

> My conclusion is that although ZFS prefetch is supposed to be adaptive and
> handle random access more or less OK, in reality there is plenty of room for
> improvement, so to speak, and for now Postgresql performance can benefit from
> its staying just disabled.  The same may apply to other database systems as
> well.

Yup; this is even spelled out at
http://www.cuddletech.com/blog/pivot/entry.php?id=1040

"...the most common complain tends to be by databases which strictly work
in fixed 8K blocks and manage their own caches very effectively. If you
think you have such a case, file-level prefetch can be tuned on the fly
using mdb, I encourage you to play with it and see what is best for your
workload..."

Anecdotal reports (which never seem to have repeatable test cases sadly)
abound about prefetch issues:

http://southbrain.com/south/2008/04/the-nightmare-comes-slowly-zfs.html
http://unix.derkeiler.com/Mailing-Lists/FreeBSD/current/2007-06/msg00671.html

Also, there was a pretty serious ZFS problem in this area that got fixed
in the middle of last year on Solaris.  Your FreeBSD install might be
based on a build that is using the older, known bad logic here.  See
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device-Level_Prefetching
and http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6437054 for
details.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: ZFS prefetch considered evil?

От
Alban Hertroys
Дата:
On Jul 9, 2009, at 3:53 AM, Yaroslav Tykhiy wrote:

> On 08/07/2009, at 8:39 PM, Alban Hertroys wrote:
>
>> On Jul 8, 2009, at 2:50 AM, Yaroslav Tykhiy wrote:
>> IIRC prefetch tries to keep data (disk blocks?) in memory that it
>> fetched recently.
>
> What you described is just a disk cache.  And a trivial
> implementation of prefetch would work as follows:  An application or
> other file/disk consumer asks the provider (driver, kernel,
> whatever) to read, say, 2 disk blocks worth of data.  The provider
> thinks, "I know you are short-sighted; I bet you are going to ask
> for more contiguous blocks very soon," so it schedules a disk read
> for many more contiguous blocks than requested and caches them in
> RAM.  For bulk data applications such as file serving this trick
> works as a charm.  But other applications do truly random access and
> they never come back after the prefetched blocks; in this case both
> disk bandwidth and cache space are wasted.  An advanced
> implementation can try to distinguish sequential and random access
> patterns, but in reality it appears to be a challenging task.

Ah yes, thanks for the correction, I now remember reading about that
before. Makes the name 'prefetch' that more fitting, doesn't it?

And as you say, it's not that useful a feature with random access
(hadn't thought about that); in fact, I can imagine that it might
delay moving the disk-heads to the next desired (random) position as
the FS is still requesting data that it isn't going to be needing
(except for some lucky cases) - unless it manages to detect the
randomness of the access patterns. You can't predict randomness from
just read requests of course, you don't know about the requests that
are still to come. You can however assume something like that is the
case if historic requests turned out to be random by nature, but then
you'd want to know for which area of the FS this is the case.

I don't know how you partitioned your zpools, but to me it seems like
it'd be preferable to have the PostgreSQL tablespaces (and possibly
other data that's likely to be accessed randomly) in a separate zpool
from the rest of the system so you can restrict disabling prefetch to
just that file-system. You probably already did that...

It could be interesting to see how clustering the relevant tables
would affect the prefetch performance, I'd expect disk access to be
less random that way. It's probably still better to disable prefetch
though.

>> ZFS uses quite a bit of memory, so if you distributed all your
>> memory to be used by just postgres and disk cache then you didn't
>> leave enough space for the prefetch data and _something_ will be
>> moved to swap.
>
> I hope you know that FreeBSD is exceptionally good at distributing
> available memory between its consumers.  That said, useless prefetch
> indeed puts extra pressure on disk cache and results in unnecessary
> cache evictions, thus making things even worse.  It is true that ZFS
> is memory hungry and so rather sensitive to non-optimal memory use
> patterns.  Useless prefetch wastes memory that could be used to
> speed up other ZFS operations.

Yes, I do know that, it's one of the reasons I prefer it over other
OSs. The keyword here was 'available memory' though, under the
assumption that something was hitting swap. But apparently that wasn't
the case.

>> You'll probably want to ask about this on the FreeBSD mailing lists
>> as well, they'll know much better than I do ;)
>
> Are you a local FreeBSD expert? ;-)  Jokes apart, I don't think this
> topic has to do with FreeBSD as such; it is mostly about making the
> advanced technologies of Postgresql and ZFS go well together.  Even
> ZFS developers admit that in database related applications
> exceptions from general ZFS practices and rules may be called for.

I wouldn't call myself an expert, I just use it on a few systems at
home and am more a user than an administrator. I do read the stable/
current mailing lists though (since 2004 according to my mail client)
and keep an eye on (among others) the ZFS discussions as I feel
tempted to change my gmirrors into zpools some day. It certainly looks
like an interesting FS, very flexible and reliable.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4a55e49a10131296212767!



Re: ZFS prefetch considered evil?

От
John R Pierce
Дата:
Alban Hertroys wrote:
> I don't know how you partitioned your zpools, but to me it seems like
> it'd be preferable to have the PostgreSQL tablespaces (and possibly
> other data that's likely to be accessed randomly) in a separate zpool
> from the rest of the system so you can restrict disabling prefetch to
> just that file-system. You probably already did that...
>
> It could be interesting to see how clustering the relevant tables
> would affect the prefetch performance, I'd expect disk access to be
> less random that way. It's probably still better to disable prefetch
> though.

in fact, somewhere in Sun.com land there's an app-note that suggests
creating TWO ZFS mount-points for Postgres, one for the $PGDATA
directory, which uses 128k blocks, and another for a tablespace that you
put all your regular databases in, this uses 8k blocks.    the idea is,
the WAL logging is relatively sequential, and takes place in the 128k
block zfs, while the actual database table files are far more purely random.

These two ZFS can be made in the same zpool, the normal recommendation
is to have one large non-root zpool mirror for all your data (and
another smaller zpool mirror for your OS, at least assuming you have
more than two physical disk drives).