Обсуждение: Shared buffers, db transactions commited, and write IO on Solaris

Поиск
Список
Период
Сортировка

Shared buffers, db transactions commited, and write IO on Solaris

От
Erik Jones
Дата:
Greetings,

We've recently made a couple changes to our system that have resulted in a drastic increase in performance as well as some very confusing changes to the database statistics, specifically pg_stat_database.xact_commit.  Here's the details:

OS: Solaris10 x86
Server: Sunfire X4100, 8GB Memory, 2 Dual Core Opterons
Postgres 8.2.3
Disk array: 
Sun STK 6130 + CSM100 SATA tray, dual channel MPXIO, 15k drives, RAID5 across 14 disks
WAL logs on SATA RAID10
SAN architecture, 2 brocade FABRIC switches

The changes we made were:

Increase shared buffers from 150000 to 200000
Set the disk mount for the data directory to use forcedirectio (added that mount option that to the /etc/vfstab entry (ufs fs))

So, the reason we did this was that for months now we'd been experiencing extremely high IO load from both the perspective of the OS and the database, specifically where writes were concerned.  During peak hourse it wasn't unheard of for pg_stat_database to report anywhere from 500000 to 1000000 transactions committed in an hour.  iostat's %b (disk busy) sat at 100% for longer than we'd care to think about with the wait percentage going from a few percent on up to 50% at times and the cpu load almost never rising from around a 2 avg., i.e. we were extremely IO bound in all cases.  

As soon as we restarted postgres after making those changes the IO load was gone.  While we the number and amount of disk reads have stayed pretty much the same and the number of disk writes have stayed the same, the amount of data being written has dropped by about a factor of 10, which is huge.  The cpu load shot way up to around a 20 avg. and stayed that way up and stayed that way for about two days (we're thinking that was autovacuum "catching up").  In addition, and this is the truly confusing part, the xact_commit and xact_rollback stats from pg_stat_database both dropped by an order of magnitude (another factor of 10).  So, we are now doing 50000 to 100000 commits per hour during peak hours.

So, where were all of those extra transactions coming from?  Are transactions reported on in pg_stat_database anything but SQL statements?  What was causing all of the excess(?!) data being written to the disk (it seems that there's a 1:1 correspondence between the xacts and volume of data being written)?  Given that we have the bgwriter on, could it have been the culprit and one of the changes allowed it to now operate more efficiently and/or correctly?

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Tom Lane
Дата:
Erik Jones <erik@myemma.com> writes:
> We've recently made a couple changes to our system that have resulted
> in a drastic increase in performance as well as some very confusing
> changes to the database statistics, specifically
> pg_stat_database.xact_commit.  Here's the details:

I'm kinda boggled too.  I can see how increasing shared buffers could
result in a drastic reduction in write rate, if the working set of your
queries fits in the new space but didn't fit in the old.  I have no idea
how that leads to a drop in number of transactions committed though.
It doesn't make sense that autovac would run less frequently, because
it's driven by number of tuples changed not number of disk writes; and
that could hardly account for a 10x drop anyway.

Did you by any chance take note of exactly which processes were
generating all the I/O or the CPU load?

            regards, tom lane

Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Erik Jones
Дата:
On Mar 29, 2007, at 11:16 AM, Tom Lane wrote:

Erik Jones <erik@myemma.com> writes:
We've recently made a couple changes to our system that have resulted  
in a drastic increase in performance as well as some very confusing  
changes to the database statistics, specifically  
pg_stat_database.xact_commit.  Here's the details:

I'm kinda boggled too.  I can see how increasing shared buffers could
result in a drastic reduction in write rate, if the working set of your
queries fits in the new space but didn't fit in the old.  I have no idea
how that leads to a drop in number of transactions committed though.
It doesn't make sense that autovac would run less frequently, because
it's driven by number of tuples changed not number of disk writes; and
that could hardly account for a 10x drop anyway.

Did you by any chance take note of exactly which processes were
generating all the I/O or the CPU load?

Well, wrt to the CPU load, as I said, we're pretty sure that's autovac as we still get spikes that hit about the same threshold, after which cache hits go up dramatically and the spikes just don't last two days anymore.

As far as the procs responsible for the writes go, we were unable to see that from the OS level as the guy we had as a systems admin last year totally screwed us with the way he set up the SunCluster on the boxes and we have been unable to run Dtrace which has left us watching a lot of iostat.  However, we did notice a direct correlation between write spikes and "write intensive" queries like large COPYs, UPDATEs, and INSERTs.

One very important thing to note here is that the number, or rather rate, of disk writes has not changed.  It's the volume of data in those writes that has dropped, along with those transaction mysterious counts.  Could the bgwriter be the culprit here?  Does anything it does get logged as a transaction?

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: Shared buffers, db transactions commited, and write IO on Solaris

От
"dimitri k"
Дата:
Erik,

using 'forcedirectio' simply brings your write operations to the
*real* volume - means while you need to write 10 bytes you'll write 10
bytes (instead of UFS block size (8K)). So it explains me why your
write volume became slower.

Now, why TX number is reduced - is a small mystery :)

Options:
   - you really do 10 times less commits, means you work 10 times slower? ;)
     what about users? how do you measure your work performance?

   - TX reported in pg_* tables are not exact, but I don't believe at all :)

Rgds,
-Dimitri

On 3/29/07, Erik Jones <erik@myemma.com> wrote:
> On Mar 29, 2007, at 11:16 AM, Tom Lane wrote:
>
> > Erik Jones <erik@myemma.com> writes:
> >> We've recently made a couple changes to our system that have resulted
> >> in a drastic increase in performance as well as some very confusing
> >> changes to the database statistics, specifically
> >> pg_stat_database.xact_commit.  Here's the details:
> >
> > I'm kinda boggled too.  I can see how increasing shared buffers could
> > result in a drastic reduction in write rate, if the working set of
> > your
> > queries fits in the new space but didn't fit in the old.  I have no
> > idea
> > how that leads to a drop in number of transactions committed though.
> > It doesn't make sense that autovac would run less frequently, because
> > it's driven by number of tuples changed not number of disk writes; and
> > that could hardly account for a 10x drop anyway.
> >
> > Did you by any chance take note of exactly which processes were
> > generating all the I/O or the CPU load?
>
> Well, wrt to the CPU load, as I said, we're pretty sure that's
> autovac as we still get spikes that hit about the same threshold,
> after which cache hits go up dramatically and the spikes just don't
> last two days anymore.
>
> As far as the procs responsible for the writes go, we were unable to
> see that from the OS level as the guy we had as a systems admin last
> year totally screwed us with the way he set up the SunCluster on the
> boxes and we have been unable to run Dtrace which has left us
> watching a lot of iostat.  However, we did notice a direct
> correlation between write spikes and "write intensive" queries like
> large COPYs, UPDATEs, and INSERTs.
>
> One very important thing to note here is that the number, or rather
> rate, of disk writes has not changed.  It's the volume of data in
> those writes that has dropped, along with those transaction
> mysterious counts.  Could the bgwriter be the culprit here?  Does
> anything it does get logged as a transaction?
>
> erik jones <erik@myemma.com>
> software developer
> 615-296-0838
> emma(r)
>
>
>
>

Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Erik Jones
Дата:
On Mar 29, 2007, at 12:41 PM, dimitri k wrote:

On 3/29/07, Erik Jones <erik@myemma.com> wrote:
On Mar 29, 2007, at 11:16 AM, Tom Lane wrote:

> Erik Jones <erik@myemma.com> writes:
>> We've recently made a couple changes to our system that have resulted
>> in a drastic increase in performance as well as some very confusing
>> changes to the database statistics, specifically
>> pg_stat_database.xact_commit.  Here's the details:
>
> I'm kinda boggled too.  I can see how increasing shared buffers could
> result in a drastic reduction in write rate, if the working set of
> your
> queries fits in the new space but didn't fit in the old.  I have no
> idea
> how that leads to a drop in number of transactions committed though.
> It doesn't make sense that autovac would run less frequently, because
> it's driven by number of tuples changed not number of disk writes; and
> that could hardly account for a 10x drop anyway.
>
> Did you by any chance take note of exactly which processes were
> generating all the I/O or the CPU load?

Well, wrt to the CPU load, as I said, we're pretty sure that's
autovac as we still get spikes that hit about the same threshold,
after which cache hits go up dramatically and the spikes just don't
last two days anymore.

As far as the procs responsible for the writes go, we were unable to
see that from the OS level as the guy we had as a systems admin last
year totally screwed us with the way he set up the SunCluster on the
boxes and we have been unable to run Dtrace which has left us
watching a lot of iostat.  However, we did notice a direct
correlation between write spikes and "write intensive" queries like
large COPYs, UPDATEs, and INSERTs.

One very important thing to note here is that the number, or rather
rate, of disk writes has not changed.  It's the volume of data in
those writes that has dropped, along with those transaction
mysterious counts.  Could the bgwriter be the culprit here?  Does
anything it does get logged as a transaction?

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)




Erik,

using 'forcedirectio' simply brings your write operations to the
*real* volume - means while you need to write 10 bytes you'll write 10
bytes (instead of UFS block size (8K)). So it explains me why your
write volume became slower.

Sorry, that's not true.  Google "ufs forcedirectio" go to the first link and you will find:

"forcedirectio

The forcedirectio (read "force direct IO") UFS option causes data to be buffered in kernel address whenever data is transferred between user address space and the disk. In other words, it bypasses the file system cache. For certain types of applications -- primarily database systems -- this option can dramatically improve performance. In fact, some database experts have argued that a file using the forcedirectio option will outperform a raw partition, though this opinion seems fairly controversial.

The forcedirectio improves file system performance by eliminating double buffering, providing a small, efficient code path for file system reads and writes and removing pressure on memory."

However, what this does mean is that writes will be at the actual filesystem block size and not the cache block size (8K v. 512K).


Now, why TX number is reduced - is a small mystery :)

Options:
  - you really do 10 times less commits, means you work 10 times slower? ;)
    what about users? how do you measure your work performance?

We are an email marketing service provider with a web front end application.  We measure work performance via web requests (counts, types, etc...), mailer activity and the resulting database activity.  We are doing as much or more work now than previously, and faster.


  - TX reported in pg_* tables are not exact, but I don't believe at all :)

Even if they aren't exact, being off by a factor of 10 wouldn't be believable.  the forcedirectio mount setting for ufs can definitely explain the drop in data written volume, but doesn't do much to explain the difference in xact commits.

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Tom Lane
Дата:
Erik Jones <erik@myemma.com> writes:
> One very important thing to note here is that the number, or rather
> rate, of disk writes has not changed.  It's the volume of data in
> those writes that has dropped, along with those transaction
> mysterious counts.

Hmm.  I'm suddenly thinking about the stats collector: in existing 8.2.x
releases it's got a bug that causes it to write the collected-stats file
much too often.  If you had done something that would shrink the size
of the stats file, that might explain this observation.  Do you have
stats_reset_on_server_start turned on?

The drop in reported transaction rate is still baffling though.  Are you
sure you're really doing the same amount of work?  Can you estimate what
you think the transaction rate *should* be from a what-are-your-clients-
doing perspective?

            regards, tom lane

Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Erik Jones
Дата:
On Mar 29, 2007, at 2:19 PM, Tom Lane wrote:

Erik Jones <erik@myemma.com> writes:
One very important thing to note here is that the number, or rather  
rate, of disk writes has not changed.  It's the volume of data in  
those writes that has dropped, along with those transaction  
mysterious counts.

Hmm.  I'm suddenly thinking about the stats collector: in existing 8.2.x
releases it's got a bug that causes it to write the collected-stats file
much too often.  If you had done something that would shrink the size
of the stats file, that might explain this observation.  Do you have 
stats_reset_on_server_start turned on?

Nope.


The drop in reported transaction rate is still baffling though.  Are you
sure you're really doing the same amount of work?  Can you estimate what
you think the transaction rate *should* be from a what-are-your-clients-
doing perspective?

Unfortunately, I can't think of any way to do that.  Our app is made up of a lot of different components and not all of them are even directly client driven.  For the client driven portions of the app any given web request can contain anywhere from around 10 to sometimes over 50 different xacts (and, that just a guesstimate).  Also, we didn't start tracking xact counts via pg_stat_database until about two months ago when we were in IO bound hell and we actually thought that the really big xact #s were normal for our app as that was the first and, thus, only numbers we had to work with.  

Also, another metric we track is to take a count from pg_stat_activity of queries running longer than 1 second every five minutes.  Before these recent changes it wasn't uncommon to see that count start to seriously stack up to over 200 at times with write intensive queries hanging out for sometimes 30 minutes or more (we'd often end having to kill them...).  Since we upped the shared buffers and turned on forcedirectio for our fs mount, that number has stayed under 50 and has only crossed 20 once. 

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Dimitri
Дата:
> >>
> > Erik,
> >
> > using 'forcedirectio' simply brings your write operations to the
> > *real* volume - means while you need to write 10 bytes you'll write 10
> > bytes (instead of UFS block size (8K)). So it explains me why your
> > write volume became slower.

I men 'lower' (not slower)

>
> Sorry, that's not true.  Google "ufs forcedirectio" go to the first
> link and you will find:
>
> "forcedirectio
>
> The forcedirectio (read "force direct IO") UFS option causes data to
> be buffered in kernel address whenever data is transferred between
> user address space and the disk. In other words, it bypasses the file
> system cache. For certain types of applications -- primarily database
> systems -- this option can dramatically improve performance. In fact,
> some database experts have argued that a file using the forcedirectio
> option will outperform a raw partition, though this opinion seems
> fairly controversial.
>
> The forcedirectio improves file system performance by eliminating
> double buffering, providing a small, efficient code path for file
> system reads and writes and removing pressure on memory."

Erik, please, don't take me wrong, but reading Google (or better man pages)
don't replace brain and basic practice... Direct IO option is not a silver
bullet which will solve all your problems (try to do 'cp' on the mounted in
'forcedirectio' filesystem, or use your mailbox on it - you'll quickly
understand impact)...

>
> However, what this does mean is that writes will be at the actual
> filesystem block size and not the cache block size (8K v. 512K).

while UFS filesystem mounted normally, it uses its own cache for all
operations (read and write) and saves data modifications on per
page basis, means: when a process writes 200 bytes there will be 200
bytes modified in cache, then whole page is written (8K) once data
demanded to be flushed (and WAL is writing per each commit)...

Now, mounted with 'forcedirectio' option UFS is free of page size constraint
and will write like a raw device an exactly demanded amount of data, means:
when a process writes 200 bytes it'll write exactly 200 bytes to the disk. For
WAL it may be very benefit, because you'll be able to perform more I/O
operations/sec, means more commit/sec. But on the same time it may
dramatically slow down all your read operations (no more data prefetch
nor dynamic cache)... The best solution probably is to separate WAL from
data (BTW, it'll be nice to have such an option as WAL_PATH in conf file),
it may be resolved by simple use of tablespace or at least directory links, etc.
But if your major activity is writing - probably it's already ok for you.

However, to understand TX number mystery I think the only possible solution
is to reproduce a small live test:

(I'm sure you're aware you can mount/unmount forcedirectio dynamically?)

during stable workload do:

   # mount -o remount,logging  /path_to_your_filesystem

and check if I/O volume is increasing as well TX numbers
than come back:

   # mount -o remount,forcedirectio  /path_to_your_filesystem

and see if I/O volume is decreasing as well TX numbers...

Best regards!
-Dimitri


>
> >
> > Now, why TX number is reduced - is a small mystery :)
> >
> > Options:
> >   - you really do 10 times less commits, means you work 10 times
> > slower? ;)
> >     what about users? how do you measure your work performance?
>
> We are an email marketing service provider with a web front end
> application.  We measure work performance via web requests (counts,
> types, etc...), mailer activity and the resulting database activity.
> We are doing as much or more work now than previously, and faster.
>
> >
> >   - TX reported in pg_* tables are not exact, but I don't believe
> > at all :)
>
> Even if they aren't exact, being off by a factor of 10 wouldn't be
> believable.  the forcedirectio mount setting for ufs can definitely
> explain the drop in data written volume, but doesn't do much to
> explain the difference in xact commits.
>
> erik jones <erik@myemma.com>
> software developer
> 615-296-0838
> emma(r)
>
>
>
>

Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Josh Berkus
Дата:
Erik,

Wow, thanks for the post.

We've just started testing the option of sizing shared_buffers bigger than
the database, and using forcedirectio in benchmarks at Sun.  So far, our
experience has been *equal* performance in that configuration, so it's
*very* interesting to see you're getting a gain.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Erik Jones
Дата:

On Mar 29, 2007, at 7:23 PM, Josh Berkus wrote:

Erik,

Wow, thanks for the post.

We've just started testing the option of sizing shared_buffers bigger than 
the database, and using forcedirectio in benchmarks at Sun.  So far, our 
experience has been *equal* performance in that configuration, so it's 
*very* interesting to see you're getting a gain.

-- 
--Josh

Josh,

You'er welcome!  However, I believe our situation is very different from what you're testing if I understand you correctly.  Are you saying that you're entire database will fit in memory?  If so, then these are very different situations as there is no way ours could ever do that.  In fact, I'm not sure that forcedirectio would really net you any gain in that situation as the IO service time will be basically nil if the filesystem cache doesn't have to page which I would think is why your seeing what you are.

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Greg Smith
Дата:
On Thu, 29 Mar 2007, Erik Jones wrote:

> As far as the procs responsible for the writes go, we were unable to see that
> from the OS level as the guy we had as a systems admin last year totally
> screwed us with the way he set up the SunCluster on the boxes and we have
> been unable to run Dtrace which has left us watching a lot of iostat.

There are two processes spawned by Postgres that handle collecting
statistics and doing the background writing.  You don't need any fancy
tools (you Solaris guys and your Dtrace, sheesh) to see if they're busy.
Just run top and switch the display to show the full command line instead
of just the process name (on Linux the 'c' key does this) and you'll see
the processes that had just been "postgres" before label themselves.

The system I saw get nailed by the bug Tom mentioned was also doing an
update-heavy workload.  It manifested itself as one CPU spending almost
all its time running the statistics collector.  That process's issues
kicked up I/O waits from minimal to >25% and the background writer was
incredibly sluggish as well.  The problem behavior was intermittant in
that it would be crippling at times, but merely a moderate slowdown at
others.  Your case sounds similar in several ways.

If you see the stats collector process taking up any significant amount of
CPU time in top, you should strongly consider the possibility that you're
suffering from this bug.  It's only a few characters to patch the bug if
you don't want to wait for the next packaged release.  In your situation,
I'd do it just eliminate this problem from your list of possible causes
ASAP.

To fix, edit src/backend/postmaster/pgstat.c
Around line 1650 you'll find:

   write_timeout.it_value.tv_usec = PGSTAT_STAT_INTERVAL % 1000;

Change it to match the current line in the CVS tree for 8.3:

   write_timeout.it_value.tv_usec = (PGSTAT_STAT_INTERVAL % 1000) * 1000;

That's all it took to resolve things for me.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Erik Jones
Дата:
On Mar 29, 2007, at 5:15 PM, Dimitri wrote:

>>
> Erik,
>
> using 'forcedirectio' simply brings your write operations to the
> *real* volume - means while you need to write 10 bytes you'll write 10
> bytes (instead of UFS block size (8K)). So it explains me why your
> write volume became slower.

I men 'lower' (not slower)


Sorry, that's not true.  Google "ufs forcedirectio" go to the first
link and you will find:

"forcedirectio

The forcedirectio (read "force direct IO") UFS option causes data to
be buffered in kernel address whenever data is transferred between
user address space and the disk. In other words, it bypasses the file
system cache. For certain types of applications -- primarily database
systems -- this option can dramatically improve performance. In fact,
some database experts have argued that a file using the forcedirectio
option will outperform a raw partition, though this opinion seems
fairly controversial.

The forcedirectio improves file system performance by eliminating
double buffering, providing a small, efficient code path for file
system reads and writes and removing pressure on memory."

Erik, please, don't take me wrong, but reading Google (or better man pages)
don't replace brain and basic practice... Direct IO option is not a silver
bullet which will solve all your problems (try to do 'cp' on the mounted in
'forcedirectio' filesystem, or use your mailbox on it - you'll quickly
understand impact)...


However, what this does mean is that writes will be at the actual
filesystem block size and not the cache block size (8K v. 512K).

while UFS filesystem mounted normally, it uses its own cache for all
operations (read and write) and saves data modifications on per
page basis, means: when a process writes 200 bytes there will be 200
bytes modified in cache, then whole page is written (8K) once data
demanded to be flushed (and WAL is writing per each commit)...

Now, mounted with 'forcedirectio' option UFS is free of page size constraint
and will write like a raw device an exactly demanded amount of data, means:
when a process writes 200 bytes it'll write exactly 200 bytes to the disk. =

You are right in that the page size constraint is lifted in that directio cuts out the VM filesystem cache.  However, the Solaris kernel still issues io ops in terms of its logical block size (which we have at the default 8K).  It can issue io ops for fragments as small as 1/8th of the block size, but Postgres issues its io requests in terms of the block size which means that io ops from Postgres will be in 8K chunks which is exactly what we see when we look at our system io stats.  In fact, if any io request is made that isn't a multiple of 512 bytes (the disk sector size), the file system switches back to the buffered io.


However, to understand TX number mystery I think the only possible solution
is to reproduce a small live test:

(I'm sure you're aware you can mount/unmount forcedirectio dynamically?)

during stable workload do:

  # mount -o remount,logging  /path_to_your_filesystem

and check if I/O volume is increasing as well TX numbers
than come back:

  # mount -o remount,forcedirectio  /path_to_your_filesystem

and see if I/O volume is decreasing as well TX numbers...

That's an excellent idea and I'll run it by the rest of our team tomorrow.

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Dimitri
Дата:
>
> You are right in that the page size constraint is lifted in that
> directio cuts out the VM filesystem cache.  However, the Solaris
> kernel still issues io ops in terms of its logical block size (which
> we have at the default 8K).  It can issue io ops for fragments as
> small as 1/8th of the block size, but Postgres issues its io requests
> in terms of the block size which means that io ops from Postgres will
> be in 8K chunks which is exactly what we see when we look at our
> system io stats.  In fact, if any io request is made that isn't a
> multiple of 512 bytes (the disk sector size), the file system
> switches back to the buffered io.

Oh, yes, of course! yes, you still need to respect multiple of 512
bytes block size on read and write - sorry, I was tired :)

Then it's seems to be true - default XLOG block size is 8K, means for
every even small auto-committed transaction we should write 8K?... Is
there any reason to use so big default block size?...

Probably it may be a good idea to put it as 'initdb' parameter? and
have such value per database server?

Rgds,
-Dimitri

>
> >
> > However, to understand TX number mystery I think the only possible
> > solution
> > is to reproduce a small live test:
> >
> > (I'm sure you're aware you can mount/unmount forcedirectio
> > dynamically?)
> >
> > during stable workload do:
> >
> >   # mount -o remount,logging  /path_to_your_filesystem
> >
> > and check if I/O volume is increasing as well TX numbers
> > than come back:
> >
> >   # mount -o remount,forcedirectio  /path_to_your_filesystem
> >
> > and see if I/O volume is decreasing as well TX numbers...
>
> That's an excellent idea and I'll run it by the rest of our team
> tomorrow.
>
> erik jones <erik@myemma.com>
> software developer
> 615-296-0838
> emma(r)
>
>
>
>

Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Erik Jones
Дата:

On Mar 30, 2007, at 8:14 AM, Dimitri wrote:


You are right in that the page size constraint is lifted in that
directio cuts out the VM filesystem cache.  However, the Solaris
kernel still issues io ops in terms of its logical block size (which
we have at the default 8K).  It can issue io ops for fragments as
small as 1/8th of the block size, but Postgres issues its io requests
in terms of the block size which means that io ops from Postgres will
be in 8K chunks which is exactly what we see when we look at our
system io stats.  In fact, if any io request is made that isn't a
multiple of 512 bytes (the disk sector size), the file system
switches back to the buffered io.

Oh, yes, of course! yes, you still need to respect multiple of 512
bytes block size on read and write - sorry, I was tired :)

Then it's seems to be true - default XLOG block size is 8K, means for
every even small auto-committed transaction we should write 8K?... Is
there any reason to use so big default block size?...

Probably it may be a good idea to put it as 'initdb' parameter? and
have such value per database server?

I believe it's because that is a pretty normal Unix kernal block size and you want the two to match.

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Dimitri
Дата:
The problem is while your goal is to commit as fast as possible - it's
pity to vast I/O operation speed just keeping common block size...
Let's say if your transaction modification entering into 512K - you'll
be able to write much more 512K blocks per second rather 8K per second
(for the same amount of data)... Even we rewrite probably several
times the same block with incoming transactions - it still costs on
traffic, and we will process slower even H/W can do better. Don't
think it's good, no? ;)

Rgds,
-Dimitri

On 3/30/07, Erik Jones <erik@myemma.com> wrote:
>
> On Mar 30, 2007, at 8:14 AM, Dimitri wrote:
>
> >>
> >> You are right in that the page size constraint is lifted in that
> >> directio cuts out the VM filesystem cache.  However, the Solaris
> >> kernel still issues io ops in terms of its logical block size (which
> >> we have at the default 8K).  It can issue io ops for fragments as
> >> small as 1/8th of the block size, but Postgres issues its io requests
> >> in terms of the block size which means that io ops from Postgres will
> >> be in 8K chunks which is exactly what we see when we look at our
> >> system io stats.  In fact, if any io request is made that isn't a
> >> multiple of 512 bytes (the disk sector size), the file system
> >> switches back to the buffered io.
> >
> > Oh, yes, of course! yes, you still need to respect multiple of 512
> > bytes block size on read and write - sorry, I was tired :)
> >
> > Then it's seems to be true - default XLOG block size is 8K, means for
> > every even small auto-committed transaction we should write 8K?... Is
> > there any reason to use so big default block size?...
> >
> > Probably it may be a good idea to put it as 'initdb' parameter? and
> > have such value per database server?
>
> I believe it's because that is a pretty normal Unix kernal block size
> and you want the two to match.
>
> erik jones <erik@myemma.com>
> software developer
> 615-296-0838
> emma(r)
>
>
>
>

Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Erik Jones
Дата:

On Mar 30, 2007, at 10:05 AM, Kenneth Marshall wrote:

On Fri, Mar 30, 2007 at 04:25:16PM +0200, Dimitri wrote:
The problem is while your goal is to commit as fast as possible - it's
pity to vast I/O operation speed just keeping common block size...
Let's say if your transaction modification entering into 512K - you'll
be able to write much more 512K blocks per second rather 8K per second
(for the same amount of data)... Even we rewrite probably several
times the same block with incoming transactions - it still costs on
traffic, and we will process slower even H/W can do better. Don't
think it's good, no? ;)

Rgds,
-Dimitri

With block sizes you are always trading off overhead versus space
efficiency. Most OS write only in 4k/8k to the underlying hardware
regardless of the size of the write you issue. Issuing 16 512byte
writes has much more overhead than 1 8k write. On the light transaction
end, there is no real benefit to a small write and it will slow
performance for high throughput environments. It would be better to,
and I think that someone is looking into, batching I/O.

Ken

True, and really, considering that data is only written to disk by the bgwriter and at checkpoints, writes are already somewhat batched.  Also, Dimitri, I feel I should backtrack a little and point out that it is possible to have postgres write in 512byte blocks (at least for UFS which is what's in my head right now) if you set the systems logical block size to 4K and fragment size to 512 bytes and then set postgres's BLCKSZ to 512bytes.  However, as Ken has just pointed out, what you gain in space efficiency you lose in performance so if you're working with a high traffic database this wouldn't be a good idea.

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Josh Berkus
Дата:
Erik,

> You'er welcome!  However, I believe our situation is very different
> from what you're testing if I understand you correctly.  Are you
> saying that you're entire database will fit in memory?  If so, then
> these are very different situations as there is no way ours could
> ever do that.  In fact, I'm not sure that forcedirectio would really
> net you any gain in that situation as the IO service time will be
> basically nil if the filesystem cache doesn't have to page which I
> would think is why your seeing what you are.

Even more interesting.  I guess we've been doing too much work with
benchmark workloads, which tend to be smaller databases.

Thing is, there's *always* I/O for a read/write database.  If nothing else,
updates have to be synched to disk.

Anyway ... regarding the mystery transactions ... are you certain that it's
not your application?  I can imagine that, if your app has a fairly tight
retry interval for database non-response, that I/O sluggishness could
result in commit attempts spinning out of control.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Erik Jones
Дата:
On Mar 30, 2007, at 4:46 PM, Josh Berkus wrote:

Erik,

You'er welcome!  However, I believe our situation is very different
from what you're testing if I understand you correctly.  Are you
saying that you're entire database will fit in memory?  If so, then
these are very different situations as there is no way ours could
ever do that.  In fact, I'm not sure that forcedirectio would really
net you any gain in that situation as the IO service time will be
basically nil if the filesystem cache doesn't have to page which I
would think is why your seeing what you are.

Even more interesting.  I guess we've been doing too much work with 
benchmark workloads, which tend to be smaller databases.  

Thing is, there's *always* I/O for a read/write database.  If nothing else, 
updates have to be synched to disk.

Right.  But, how *much* I/O?


Anyway ... regarding the mystery transactions ... are you certain that it's 
not your application?  I can imagine that, if your app has a fairly tight 
retry interval for database non-response, that I/O sluggishness could 
result in commit attempts spinning out of control.

Well, our application code itself doesn't retry queries if the db is taking a long time to respond.  However, we do have a number of our servers making db connections via pgpool so you may be on to something here.  While I will be taking these questions to the pgpool lists, I'll posit them here as well:  If a pgpool child process reaches it's connection lifetime while waiting on a query to complete, does pgpool retry the query with another child?  If a connection thus dies, does the transaction complete normally on the server?  If the answers to these questions are both yes, this could definitely be what was happening.

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Josh Berkus
Дата:
Erik,

> Well, our application code itself doesn't retry queries if the db is
> taking a long time to respond.  However, we do have a number of our
> servers making db connections via pgpool so you may be on to
> something here.  While I will be taking these questions to the pgpool
> lists, I'll posit them here as well:  If a pgpool child process
> reaches it's connection lifetime while waiting on a query to
> complete, does pgpool retry the query with another child?  If a
> connection thus dies, does the transaction complete normally on the
> server?  If the answers to these questions are both yes, this could
> definitely be what was happening.

It's been a while since I used pgpool with load balancing turned on, so you
should probably try the pgpool lists.  What version?

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Dimitri
Дата:
Folks,

to close topic with "LOG block size=1K" idea - I took a time to test
it (yes) and in best cases there is only 15% gain comparing to 8K -
storage protocol is quite heavy itself, so less or more data sent
within it doesn't reduce service time too much... As well even this
gain is quickly decreasing with growing workload! So, yes 8K is good
enough and probably the most optimal choice for LOG (as well data)
block size.

Rgds,
-Dimitri


> Well, to check if there is a real potential gain all we need is a
> small comparing test using PgSQL compiled with LOG block size equal to
> say 1K and direct IO enabled.
>
> Rgds,
> -Dimitri
>
>
> On 3/30/07, Kenneth Marshall <ktm@rice.edu> wrote:
> > On Fri, Mar 30, 2007 at 04:25:16PM +0200, Dimitri wrote:
> > > The problem is while your goal is to commit as fast as possible - it's
> > > pity to vast I/O operation speed just keeping common block size...
> > > Let's say if your transaction modification entering into 512K - you'll
> > > be able to write much more 512K blocks per second rather 8K per second
> > > (for the same amount of data)... Even we rewrite probably several
> > > times the same block with incoming transactions - it still costs on
> > > traffic, and we will process slower even H/W can do better. Don't
> > > think it's good, no? ;)
> > >
> > > Rgds,
> > > -Dimitri
> > >
> > With block sizes you are always trading off overhead versus space
> > efficiency. Most OS write only in 4k/8k to the underlying hardware
> > regardless of the size of the write you issue. Issuing 16 512byte
> > writes has much more overhead than 1 8k write. On the light transaction
> > end, there is no real benefit to a small write and it will slow
> > performance for high throughput environments. It would be better to,
> > and I think that someone is looking into, batching I/O.
> >
> > Ken
> >
>

Re: Shared buffers, db transactions commited, and write IO on Solaris

От
Erik Jones
Дата:

On Apr 3, 2007, at 11:51 AM, Dimitri wrote:

Well, to check if there is a real potential gain all we need is a
small comparing test using PgSQL compiled with LOG block size equal to
say 1K and direct IO enabled.

Rgds,
-Dimitri


On 3/30/07, Kenneth Marshall <ktm@rice.edu> wrote:
> On Fri, Mar 30, 2007 at 04:25:16PM +0200, Dimitri wrote:
> > The problem is while your goal is to commit as fast as possible - it's
> > pity to vast I/O operation speed just keeping common block size...
> > Let's say if your transaction modification entering into 512K - you'll
> > be able to write much more 512K blocks per second rather 8K per second
> > (for the same amount of data)... Even we rewrite probably several
> > times the same block with incoming transactions - it still costs on
> > traffic, and we will process slower even H/W can do better. Don't
> > think it's good, no? ;)
> >
> > Rgds,
> > -Dimitri
> >
> With block sizes you are always trading off overhead versus space
> efficiency. Most OS write only in 4k/8k to the underlying hardware
> regardless of the size of the write you issue. Issuing 16 512byte
> writes has much more overhead than 1 8k write. On the light transaction
> end, there is no real benefit to a small write and it will slow
> performance for high throughput environments. It would be better to,
> and I think that someone is looking into, batching I/O.
>
> Ken
>

Folks,

to close topic with "LOG block size=1K" idea - I took a time to test
it (yes) and in best cases there is only 15% gain comparing to 8K -
storage protocol is quite heavy itself, so less or more data sent
within it doesn't reduce service time too much... As well even this
gain is quickly decreasing with growing workload! So, yes 8K is good
enough and probably the most optimal choice for LOG (as well data)
block size.

Rgds,
-Dimitri

Hey, man, thanks for taking the time to profile that!


erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: Shared buffers, db transactions commited, and write IO on Solaris

От
"Jim C. Nasby"
Дата:
On Fri, Mar 30, 2007 at 11:19:09AM -0500, Erik Jones wrote:
> >On Fri, Mar 30, 2007 at 04:25:16PM +0200, Dimitri wrote:
> >>The problem is while your goal is to commit as fast as possible -
> >>it's
> >>pity to vast I/O operation speed just keeping common block size...
> >>Let's say if your transaction modification entering into 512K -
> >>you'll
> >>be able to write much more 512K blocks per second rather 8K per
> >>second
> >>(for the same amount of data)... Even we rewrite probably several
> >>times the same block with incoming transactions - it still costs on
> >>traffic, and we will process slower even H/W can do better. Don't
> >>think it's good, no? ;)
> >>
> >>Rgds,
> >>-Dimitri
> >>
> >With block sizes you are always trading off overhead versus space
> >efficiency. Most OS write only in 4k/8k to the underlying hardware
> >regardless of the size of the write you issue. Issuing 16 512byte
> >writes has much more overhead than 1 8k write. On the light
> >transaction
> >end, there is no real benefit to a small write and it will slow
> >performance for high throughput environments. It would be better to,
> >and I think that someone is looking into, batching I/O.
> >
> >Ken
>
> True, and really, considering that data is only written to disk by
> the bgwriter and at checkpoints, writes are already somewhat
> batched.  Also, Dimitri, I feel I should backtrack a little and point
> out that it is possible to have postgres write in 512byte blocks (at
> least for UFS which is what's in my head right now) if you set the
> systems logical block size to 4K and fragment size to 512 bytes and
> then set postgres's BLCKSZ to 512bytes.  However, as Ken has just
> pointed out, what you gain in space efficiency you lose in
> performance so if you're working with a high traffic database this
> wouldn't be a good idea.

Sorry for the late reply, but I was on vacation...

Folks have actually benchmarked filesystem block size on linux and found
that block sizes larger than 8k can actually be faster. I suppose if you
had a workload that *always* worked with only individual pages it would
be a waste, but it doesn't take much sequential reading to tip the
scales.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)