Обсуждение: performance on new linux box

От:
Ryan Wexler
Дата:

Postgresql was previously running on a single cpu linux machine with 2 gigs of memory and a single sata drive (v8.3).  Basically a desktop with linux on it.  I experienced slow performance.

So, I finally moved it to a real server.  A dually zeon centos machine with 6 gigs of memory and raid 10, postgres 8.4.  But, I am now experiencing even worse performance issues.

My system is consistently highly transactional.  However, there is also regular complex queries and occasional bulk loads.

On the new system the bulk loads are extremely slower than on the previous machine and so are the more complex queries.  The smaller transactional queries seem comparable but i had expected an improvement.  Performing a db import via psql -d databas -f dbfile illustrates this problem.  It takes 5 hours to run this import.  By contrast, if I perform this same exact import on my crappy windows box with only 2 gigs of memory and default postgres settings it takes 1 hour.  Same deal with the old linux machine.  How is this possible?

Here are some of my key config settings:
max_connections = 100
shared_buffers = 768MB         
effective_cache_size = 2560MB
work_mem = 16MB                
maintenance_work_mem = 128MB   
checkpoint_segments = 7        
checkpoint_timeout = 7min      
checkpoint_completion_target = 0.5

I have tried varying the shared_buffers size from 128 all the way to 1500mbs and got basically the same result.   Is there a setting change I should be considering?

Does 8.4 have performance problems or is this unique to me? 

thanks

От:
Tom Lane
Дата:

Ryan Wexler <> writes:
> Postgresql was previously running on a single cpu linux machine with 2 gigs
> of memory and a single sata drive (v8.3).  Basically a desktop with linux on
> it.  I experienced slow performance.

> So, I finally moved it to a real server.  A dually zeon centos machine with
> 6 gigs of memory and raid 10, postgres 8.4.  But, I am now experiencing even
> worse performance issues.

I'm wondering if you moved to a kernel+filesystem version that actually
enforces fsync, from one that didn't.  If so, the apparently faster
performance on the old box was being obtained at the cost of (lack of)
crash safety.  That probably goes double for your windows-box comparison
point.

You could try test_fsync from the Postgres sources to confirm that
theory, or do some pgbench benchmarking to have more quantifiable
numbers.

See past discussions about write barriers in this list's archives for
more detail.

            regards, tom lane

От:
Rob Wultsch
Дата:

On Wed, Jul 7, 2010 at 4:06 PM, Ryan Wexler <> wrote:
> Postgresql was previously running on a single cpu linux machine with 2 gigs
> of memory and a single sata drive (v8.3).  Basically a desktop with linux on
> it.  I experienced slow performance.
>
> So, I finally moved it to a real server.  A dually zeon centos machine with
> 6 gigs of memory and raid 10, postgres 8.4.  But, I am now experiencing even
> worse performance issues.
>
> My system is consistently highly transactional.  However, there is also
> regular complex queries and occasional bulk loads.
>
> On the new system the bulk loads are extremely slower than on the previous
> machine and so are the more complex queries.  The smaller transactional
> queries seem comparable but i had expected an improvement.  Performing a db
> import via psql -d databas -f dbfile illustrates this problem.  It takes 5
> hours to run this import.  By contrast, if I perform this same exact import
> on my crappy windows box with only 2 gigs of memory and default postgres
> settings it takes 1 hour.  Same deal with the old linux machine.  How is
> this possible?
>
> Here are some of my key config settings:
> max_connections = 100
> shared_buffers = 768MB
> effective_cache_size = 2560MB
> work_mem = 16MB
> maintenance_work_mem = 128MB
> checkpoint_segments = 7
> checkpoint_timeout = 7min
> checkpoint_completion_target = 0.5
>
> I have tried varying the shared_buffers size from 128 all the way to 1500mbs
> and got basically the same result.   Is there a setting change I should be
> considering?
>
> Does 8.4 have performance problems or is this unique to me?
>
> thanks
>
>

I think the most likely explanation is that the crappy box lied about
fsync'ing data and your server is not. Did you purchase a raid card
with a bbu? If so, can you set the write cache policy to write-back?

--
Rob Wultsch


От:
Andy Colson
Дата:

On 07/07/2010 06:06 PM, Ryan Wexler wrote:
> Postgresql was previously running on a single cpu linux machine with 2 gigs of memory and a single sata drive (v8.3).
Basically a desktop with linux on it.  I experienced slow performance. 
>
> So, I finally moved it to a real server.  A dually zeon centos machine with 6 gigs of memory and raid 10, postgres
8.4. But, I am now experiencing even worse performance issues. 
>
> My system is consistently highly transactional.  However, there is also regular complex queries and occasional bulk
loads.
>
> On the new system the bulk loads are extremely slower than on the previous machine and so are the more complex
queries. The smaller transactional queries seem comparable but i had expected an improvement.  Performing a db import
viapsql -d databas -f dbfile illustrates this problem.  It takes 5 hours to run this import.  By contrast, if I perform
thissame exact import on my crappy windows box with only 2 gigs of memory and default postgres settings it takes 1
hour. Same deal with the old linux machine.  How is this possible? 
>
> Here are some of my key config settings:
> max_connections = 100
> shared_buffers = 768MB
> effective_cache_size = 2560MB
> work_mem = 16MB
> maintenance_work_mem = 128MB
> checkpoint_segments = 7
> checkpoint_timeout = 7min
> checkpoint_completion_target = 0.5
>
> I have tried varying the shared_buffers size from 128 all the way to 1500mbs and got basically the same result.   Is
therea setting change I should be considering? 
>
> Does 8.4 have performance problems or is this unique to me?
>
> thanks
>

Yeah, I inherited a "server" (the quotes are sarcastic air quotes), with really bad disk IO... er.. really safe disk
IO. Try the dd test.  On my desktop I get 60-70 meg a second.  On this "server" (I laugh) I got about 20.  I had to go
outof my way (way out) to enable the disk caching, and even then only got 50 meg a second. 

http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm

-Andy


От:
"Pierre C"
Дата:

> On the new system the bulk loads are extremely slower than on the
> previous
> machine and so are the more complex queries.  The smaller transactional
> queries seem comparable but i had expected an improvement.  Performing a
> db
> import via psql -d databas -f dbfile illustrates this problem.

If you use psql (not pg_restore) and your file contains no BEGIN/COMMIT
statements, you're probably doing 1 transaction per SQL command. As the
others say, if the old box lied about fsync, and the new one doesn't,
performance will suffer greatly. If this is the case, remember to do your
imports the proper way : either use pg_restore, or group inserts in a
transaction, and build indexes in parallel.

От:
Eliot Gable
Дата:


On Wed, Jul 7, 2010 at 10:07 PM, Andy Colson <> wrote:
On 07/07/2010 06:06 PM, Ryan Wexler wrote:
Postgresql was previously running on a single cpu linux machine with 2 gigs of memory and a single sata drive (v8.3).  Basically a desktop with linux on it.  I experienced slow performance.

So, I finally moved it to a real server.  A dually zeon centos machine with 6 gigs of memory and raid 10, postgres 8.4.  But, I am now experiencing even worse performance issues.

My system is consistently highly transactional.  However, there is also regular complex queries and occasional bulk loads.

On the new system the bulk loads are extremely slower than on the previous machine and so are the more complex queries.  The smaller transactional queries seem comparable but i had expected an improvement.  Performing a db import via psql -d databas -f dbfile illustrates this problem.  It takes 5 hours to run this import.  By contrast, if I perform this same exact import on my crappy windows box with only 2 gigs of memory and default postgres settings it takes 1 hour.  Same deal with the old linux machine.  How is this possible?

Here are some of my key config settings:
max_connections = 100
shared_buffers = 768MB
effective_cache_size = 2560MB
work_mem = 16MB
maintenance_work_mem = 128MB
checkpoint_segments = 7
checkpoint_timeout = 7min
checkpoint_completion_target = 0.5

I have tried varying the shared_buffers size from 128 all the way to 1500mbs and got basically the same result.   Is there a setting change I should be considering?

Does 8.4 have performance problems or is this unique to me?

thanks


Yeah, I inherited a "server" (the quotes are sarcastic air quotes), with really bad disk IO... er.. really safe disk IO.  Try the dd test.  On my desktop I get 60-70 meg a second.  On this "server" (I laugh) I got about 20.  I had to go out of my way (way out) to enable the disk caching, and even then only got 50 meg a second.

http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm


For about $2k - $3k, you can get a server that will do upwards of 300 MB/sec, assuming the bulk of that cost goes to a good hardware-based RAID controller with a battery backed-up cache and some good 15k RPM SAS drives. Since it sounds like you are disk I/O bound, it's probably not worth it for you to spend extra on CPU and memory. Sink the money into the disk array instead. If you have an extra $4k more money in your budget, you might even try 4 of these in a RAID 10:

http://www.provantage.com/ocz-technology-oczssd2-2vtxex100g~7OCZT0L9.htm



--
Eliot Gable

От:
"Kevin Grittner"
Дата:

Eliot Gable <> wrote:

> For about $2k - $3k, you can get a server that will do upwards of
> 300 MB/sec, assuming the bulk of that cost goes to a good
> hardware-based RAID controller with a battery backed-up cache and
> some good 15k RPM SAS drives.

FWIW, I concur that the description so far suggests that this server
either doesn't have a good RAID controller card with battery backed-
up (BBU) cache, or that it isn't configured properly.

-Kevin

От:
Eliot Gable
Дата:


On Thu, Jul 8, 2010 at 9:53 AM, Kevin Grittner <> wrote:
Eliot Gable <> wrote:

> For about $2k - $3k, you can get a server that will do upwards of
> 300 MB/sec, assuming the bulk of that cost goes to a good
> hardware-based RAID controller with a battery backed-up cache and
> some good 15k RPM SAS drives.

FWIW, I concur that the description so far suggests that this server
either doesn't have a good RAID controller card with battery backed-
up (BBU) cache, or that it isn't configured properly.


On another note, it is also entirely possible that just re-writing your queries will completely solve your problem and make your performance bottleneck go away. Sometimes throwing hardware at a problem is not the best (or cheapest) solution. Personally, I would never throw hardware at a problem until I am certain that I have everything else optimized as much as possible. One of the stored procedures I recently wrote in pl/pgsql was originally chewing up my entire development box's processing capabilities at just 20 transactions per second. It's a pretty wimpy box, so I was not really expecting a lot out of it. However, after spending several weeks optimizing my queries, I now have it doing twice as much work at 120 transactions per second on the same box. So, if I had thrown hardware at the problem, I would have spent 12 times more on hardware than I need to spend now for the same level of performance.

If you can post some of your queries, there are a lot of bright people on this discussion list that can probably help you solve your bottleneck without spending a ton of money on new hardware. Obviously, there is no guarantee -- you might already be as optimized as you can get in your queries, but I doubt it. Even after spending months tweaking my queries, I am still finding things here and there where I can get a bit more performance out of them.

--
Eliot Gable


От:
"Kevin Grittner"
Дата:

Eliot Gable <> wrote:

> If you can post some of your queries, there are a lot of bright
> people on this discussion list that can probably help you solve
> your bottleneck

Sure, but the original post was because the brand new server class
machine was performing much worse than the single-drive desktop
machine *on the same queries*, which seems like an issue worthy of
investigation independently of what you suggest.

-Kevin

От:
Ryan Wexler
Дата:

Thanks a lot for all the comments.  The fact that both my windows box and the old linux box both show a massive performance improvement over the new linux box seems to point to hardware to me.  I am not sure how to test the fsync issue, but i don't see how that could be it.

The raid card the server has in it is:
3Ware 4 Port 9650SE-4LPML RAID Card

Looking it up, it seems to indicate that it has BBU

The only other difference between the boxes is the postgresql version.  The new one has 8.4-2 from the yum install instructions on the site:
http://yum.pgrpms.org/reporpms/repoview/pgdg-centos.html

Any more thoughts?

On Thu, Jul 8, 2010 at 8:02 AM, Kevin Grittner <> wrote:
Eliot Gable <> wrote:

> If you can post some of your queries, there are a lot of bright
> people on this discussion list that can probably help you solve
> your bottleneck

Sure, but the original post was because the brand new server class
machine was performing much worse than the single-drive desktop
machine *on the same queries*, which seems like an issue worthy of
investigation independently of what you suggest.

-Kevin

От:
"Joshua D. Drake"
Дата:

On Thu, 2010-07-08 at 09:31 -0700, Ryan Wexler wrote:
> The raid card the server has in it is:
> 3Ware 4 Port 9650SE-4LPML RAID Card
>
> Looking it up, it seems to indicate that it has BBU

No. It supports a BBU. It doesn't have one necessarily.

You need to go into your RAID BIOS. It will tell you.

Sincerely,

Joshua D. Drake


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
От:
John Rouillard
Дата:

On Thu, Jul 08, 2010 at 09:31:32AM -0700, Ryan Wexler wrote:
> Thanks a lot for all the comments.  The fact that both my windows box and
> the old linux box both show a massive performance improvement over the new
> linux box seems to point to hardware to me.  I am not sure how to test the
> fsync issue, but i don't see how that could be it.
>
> The raid card the server has in it is:
> 3Ware 4 Port 9650SE-4LPML RAID Card
>
> Looking it up, it seems to indicate that it has BBU

By "looking it up", I assume you mean running tw_cli and looking at
the output to make sure the bbu is enabled and the cache is turned on
for the raid array u0 or u1 ...?

--
                -- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

От:
Craig James
Дата:

On 7/8/10 9:31 AM, Ryan Wexler wrote:
> Thanks a lot for all the comments.  The fact that both my windows box
> and the old linux box both show a massive performance improvement over
> the new linux box seems to point to hardware to me.  I am not sure how
> to test the fsync issue, but i don't see how that could be it.
>
> The raid card the server has in it is:
> 3Ware 4 Port 9650SE-4LPML RAID Card
>
> Looking it up, it seems to indicate that it has BBU

Make sure the battery isn't dead.  Most RAID controllers drop to non-BBU speeds if they detect that the battery is
faulty.

Craig

От:
Ryan Wexler
Дата:



---------- Forwarded message ----------
From: Ryan Wexler <>
Date: Thu, Jul 8, 2010 at 10:12 AM
Subject: Re: [PERFORM] performance on new linux box
To: Craig James <>


On Thu, Jul 8, 2010 at 10:10 AM, Craig James <> wrote:
On 7/8/10 9:31 AM, Ryan Wexler wrote:
Thanks a lot for all the comments.  The fact that both my windows box
and the old linux box both show a massive performance improvement over
the new linux box seems to point to hardware to me.  I am not sure how
to test the fsync issue, but i don't see how that could be it.

The raid card the server has in it is:
3Ware 4 Port 9650SE-4LPML RAID Card

Looking it up, it seems to indicate that it has BBU

Make sure the battery isn't dead.  Most RAID controllers drop to non-BBU speeds if they detect that the battery is faulty.

Craig

--
Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Thanks.  The server is hosted, so it is a bit of a hassle to figure this stuff out, but I am having someone check.

От:
Jochen Erwied
Дата:

Thursday, July 8, 2010, 7:16:47 PM you wrote:

> Thanks.  The server is hosted, so it is a bit of a hassle to figure this
> stuff out, but I am having someone check.

If you have root access to the machine, you should try 'tw_cli /cx show',
where the x in /cx is the controller number. If not present on the machine,
the command-line-tools are available from 3ware in their download-section.

You should get an output showing something like this:

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       202    01-Jan-1970

Don't ask why the 'LastCapTest' does not show a valid value, the bbu here
completed the test successfully.

--
Jochen Erwied     |   home:      +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work:   +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile:        +49-173-5404164


От:
Ryan Wexler
Дата:



On Thu, Jul 8, 2010 at 12:13 PM, Jochen Erwied <> wrote:
Thursday, July 8, 2010, 7:16:47 PM you wrote:

> Thanks.  The server is hosted, so it is a bit of a hassle to figure this
> stuff out, but I am having someone check.

If you have root access to the machine, you should try 'tw_cli /cx show',
where the x in /cx is the controller number. If not present on the machine,
the command-line-tools are available from 3ware in their download-section.

You should get an output showing something like this:

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       202    01-Jan-1970

Don't ask why the 'LastCapTest' does not show a valid value, the bbu here
completed the test successfully.

--
Jochen Erwied     |   home:     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work:  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile:       +49-173-5404164


The twi_cli package doesn't appear to be installed.  I will try to hunt it down. 
However, I just verified with the hosting company that BBU is off on the raid controller.  I am trying to find out my options, turn it on, different card, etc...

От:
"Kevin Grittner"
Дата:

Ryan Wexler <> wrote:

> I just verified with the hosting company that BBU is off on the
> raid controller.  I am trying to find out my options, turn it on,
> different card, etc...

In the "etc." category, make sure that when you get it turned on,
the cache is configured for "write back" mode, not "write through"
mode.  Ideally (if you can't afford to lose the data), it will be
configured to degrade to "write through" if the battery fails.

-Kevin

От:
Jochen Erwied
Дата:

Thursday, July 8, 2010, 9:18:20 PM you wrote:

> However, I just verified with the hosting company that BBU is off on the
> raid controller.  I am trying to find out my options, turn it on, different
> card, etc...

Turning it on requires the external BBU to be installed, so even if a 9650
has BBU support, it requires the hardware on a pluggable card.

And even If the BBU is present, it requires to pass the selftest once until
you are able to turn on write caching.


--
Jochen Erwied     |   home:      +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work:   +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile:        +49-173-5404164


От:
Ryan Wexler
Дата:


On Thu, Jul 8, 2010 at 12:32 PM, Jochen Erwied <> wrote:
Thursday, July 8, 2010, 9:18:20 PM you wrote:

> However, I just verified with the hosting company that BBU is off on the
> raid controller.  I am trying to find out my options, turn it on, different
> card, etc...

Turning it on requires the external BBU to be installed, so even if a 9650
has BBU support, it requires the hardware on a pluggable card.

And even If the BBU is present, it requires to pass the selftest once until
you are able to turn on write caching.


--
Jochen Erwied     |   home:     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work:  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile:       +49-173-5404164


One thing I don't understand is why BBU will result in a huge performance gain.  I thought BBU was all about power failures?
От:
Ben Chobot
Дата:

On Jul 8, 2010, at 12:37 PM, Ryan Wexler wrote:

> One thing I don't understand is why BBU will result in a huge performance gain.  I thought BBU was all about power
failures?

When you have a working BBU, the raid card can safely do write caching. Without it, many raid cards are good about
turningoff write caching on the disks and refusing to do it themselves. (Safety over performance.) 

От:
"Kevin Grittner"
Дата:

Ryan Wexler <> wrote:

> One thing I don't understand is why BBU will result in a huge
> performance gain.  I thought BBU was all about power failures?

Well, it makes it safe for the controller to consider the write
complete as soon as it hits the RAM cache, rather than waiting for
persistence to the disk itself.  It can then schedule the writes in
a manner which is efficient based on the physical medium.

Something like this was probably happening on your non-server
machines, but without BBU it was not actually safe.  Server class
machines tend to be more conservative about not losing your data,
but without a RAID controller with BBU cache, that slows writes down
to the speed of the rotating disks.

-Kevin

От:
Ryan Wexler
Дата:



On Thu, Jul 8, 2010 at 12:46 PM, Kevin Grittner <> wrote:
Ryan Wexler <> wrote:

> One thing I don't understand is why BBU will result in a huge
> performance gain.  I thought BBU was all about power failures?

Well, it makes it safe for the controller to consider the write
complete as soon as it hits the RAM cache, rather than waiting for
persistence to the disk itself.  It can then schedule the writes in
a manner which is efficient based on the physical medium.

Something like this was probably happening on your non-server
machines, but without BBU it was not actually safe.  Server class
machines tend to be more conservative about not losing your data,
but without a RAID controller with BBU cache, that slows writes down
to the speed of the rotating disks.

-Kevin
Thanks for the explanations that makes things clearer.  It still amazes me that it would account for a 5x change in IO.
От:
David Boreham
Дата:

On 7/8/2010 1:47 PM, Ryan Wexler wrote:
> Thanks for the explanations that makes things clearer.  It still
> amazes me that it would account for a 5x change in IO.

The buffering allows decoupling of the write rate from the disk rotation
speed.
Disks don't spin that fast, at least not relative to the speed the CPU
is running at.




От:
"Kevin Grittner"
Дата:

Ryan Wexler <> wrote:

> It still amazes me that it would account for a 5x change in IO.

If you were doing one INSERT per database transaction, for instance,
that would not be at all surprising.  If you were doing one COPY in
of a million rows, it would be a bit more surprising.

Each COMMIT of a database transaction, without caching, requires
that you wait for the disk to rotate around to the right position.
Compared to the speed of RAM, that can take quite a long time.  With
write caching, you might write quite a few adjacent disk sectors to
the cache, which can then all be streamed to disk on one rotation.
It can also do tricks like writing a bunch of sectors on one part of
the disk before pulling the heads all the way over to another
portion of the disk to write a bunch of sectors.

It is very good for performance to cache writes.

-Kevin

От:
Craig James
Дата:

On 7/8/10 12:47 PM, Ryan Wexler wrote:
>
>
> On Thu, Jul 8, 2010 at 12:46 PM, Kevin Grittner
> < <mailto:>> wrote:
>
>     Ryan Wexler < <mailto:>>
>     wrote:
>
>      > One thing I don't understand is why BBU will result in a huge
>      > performance gain.  I thought BBU was all about power failures?
>
>     Well, it makes it safe for the controller to consider the write
>     complete as soon as it hits the RAM cache, rather than waiting for
>     persistence to the disk itself.  It can then schedule the writes in
>     a manner which is efficient based on the physical medium.
>
>     Something like this was probably happening on your non-server
>     machines, but without BBU it was not actually safe.  Server class
>     machines tend to be more conservative about not losing your data,
>     but without a RAID controller with BBU cache, that slows writes down
>     to the speed of the rotating disks.
>
>     -Kevin
>
> Thanks for the explanations that makes things clearer.  It still amazes
> me that it would account for a 5x change in IO.

It's not exactly a 5x change in I/O, rather it's a 5x change in *transactions*.  Without a BBU Postgres has to wait for
eachtransaction to by physically written to the disk, which at 7200 RPM (or 10K or 15K) means a few hundred per second.
Most of the time Postgres is just sitting there waiting for the disk to say, "OK, I did it."  With BBU, once the RAID
cardhas the data, it's virtually guaranteed it will get to the disk even if the power fails, so the RAID controller
says,"OK, I did it" even though the data is still in the controller's cache and not actually on the disk. 

It means there's no tight relationship between the disk's rotational speed and your transaction rate.

Craig

От:
Ryan Wexler
Дата:

On Thu, Jul 8, 2010 at 12:13 PM, Jochen Erwied <> wrote:
Thursday, July 8, 2010, 7:16:47 PM you wrote:

> Thanks.  The server is hosted, so it is a bit of a hassle to figure this
> stuff out, but I am having someone check.

If you have root access to the machine, you should try 'tw_cli /cx show',
where the x in /cx is the controller number. If not present on the machine,
the command-line-tools are available from 3ware in their download-section.

You should get an output showing something like this:

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       202    01-Jan-1970

Don't ask why the 'LastCapTest' does not show a valid value, the bbu here
completed the test successfully.

--
Jochen Erwied     |   home:     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work:  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile:       +49-173-5404164


Here is what I got:
# ./tw_cli /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-10   OK             -       -       64K     465.641   OFF    ON

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.81 GB   490350672     WD-WCAT1F502612
p1     OK               u0     233.81 GB   490350672     WD-WCAT1F472718
p2     OK               u0     233.81 GB   490350672     WD-WCAT1F216268
p3     OK               u0     233.81 GB   490350672     WD-WCAT1F216528



От:
Дата:

How does the linux machine know that there is a BBU installed and to
change its behavior or change the behavior of Postgres? I am
experiencing performance issues, not with searching but more with IO.

-----Original Message-----
From: 
[mailto:] On Behalf Of Craig James
Sent: Thursday, July 08, 2010 4:02 PM
To: 
Subject: Re: [PERFORM] performance on new linux box

On 7/8/10 12:47 PM, Ryan Wexler wrote:
>
>
> On Thu, Jul 8, 2010 at 12:46 PM, Kevin Grittner
> < <mailto:>>
wrote:
>
>     Ryan Wexler < <mailto:>>
>     wrote:
>
>      > One thing I don't understand is why BBU will result in a huge
>      > performance gain.  I thought BBU was all about power failures?
>
>     Well, it makes it safe for the controller to consider the write
>     complete as soon as it hits the RAM cache, rather than waiting for
>     persistence to the disk itself.  It can then schedule the writes
in
>     a manner which is efficient based on the physical medium.
>
>     Something like this was probably happening on your non-server
>     machines, but without BBU it was not actually safe.  Server class
>     machines tend to be more conservative about not losing your data,
>     but without a RAID controller with BBU cache, that slows writes
down
>     to the speed of the rotating disks.
>
>     -Kevin
>
> Thanks for the explanations that makes things clearer.  It still
amazes
> me that it would account for a 5x change in IO.

It's not exactly a 5x change in I/O, rather it's a 5x change in
*transactions*.  Without a BBU Postgres has to wait for each transaction
to by physically written to the disk, which at 7200 RPM (or 10K or 15K)
means a few hundred per second.  Most of the time Postgres is just
sitting there waiting for the disk to say, "OK, I did it."  With BBU,
once the RAID card has the data, it's virtually guaranteed it will get
to the disk even if the power fails, so the RAID controller says, "OK, I
did it" even though the data is still in the controller's cache and not
actually on the disk.

It means there's no tight relationship between the disk's rotational
speed and your transaction rate.

Craig

--
Sent via pgsql-performance mailing list
()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


От:
Jochen Erwied
Дата:

Thursday, July 8, 2010, 11:02:50 PM you wrote:

> Here is what I got:
> # ./tw_cli /c0 show

If that's all you get, than there's no BBU installed, or not correctly
connected to the controller.

You could try 'tw_cli /c0/bbu show all' to be sure, but I doubt your output
will change-

--
Jochen Erwied     |   home:      +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work:   +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile:        +49-173-5404164


От:
Craig James
Дата:

On 7/8/10 2:18 PM,  wrote:
> How does the linux machine know that there is a BBU installed and to
> change its behavior or change the behavior of Postgres? I am
> experiencing performance issues, not with searching but more with IO.

It doesn't.  It trusts the disk controller.  Linux says, "Flush your cache" and the controller says, "OK, it's
flushed." In the case of a BBU controller, the controller can say that almost instantly because it's got the data in a
battery-backedmemory that will survive even if the power goes out.  In the case of a non-BBU controller (RAID or
non-RAID),the controller has to actually wait for the head to move to the right spot, then wait for the disk to spin
aroundto the right sector, then write the data.  Only then can it say, "OK, it's flushed." 

So to Linux, it just appears to be a disk that's exceptionally fast at flushing its buffers.

Craig

>
> -----Original Message-----
> From: 
> [mailto:] On Behalf Of Craig James
> Sent: Thursday, July 08, 2010 4:02 PM
> To: 
> Subject: Re: [PERFORM] performance on new linux box
>
> On 7/8/10 12:47 PM, Ryan Wexler wrote:
>>
>>
>> On Thu, Jul 8, 2010 at 12:46 PM, Kevin Grittner
>> <<mailto:>>
> wrote:
>>
>>      Ryan Wexler<<mailto:>>
>>      wrote:
>>
>>       >  One thing I don't understand is why BBU will result in a huge
>>       >  performance gain.  I thought BBU was all about power failures?
>>
>>      Well, it makes it safe for the controller to consider the write
>>      complete as soon as it hits the RAM cache, rather than waiting for
>>      persistence to the disk itself.  It can then schedule the writes
> in
>>      a manner which is efficient based on the physical medium.
>>
>>      Something like this was probably happening on your non-server
>>      machines, but without BBU it was not actually safe.  Server class
>>      machines tend to be more conservative about not losing your data,
>>      but without a RAID controller with BBU cache, that slows writes
> down
>>      to the speed of the rotating disks.
>>
>>      -Kevin
>>
>> Thanks for the explanations that makes things clearer.  It still
> amazes
>> me that it would account for a 5x change in IO.
>
> It's not exactly a 5x change in I/O, rather it's a 5x change in
> *transactions*.  Without a BBU Postgres has to wait for each transaction
> to by physically written to the disk, which at 7200 RPM (or 10K or 15K)
> means a few hundred per second.  Most of the time Postgres is just
> sitting there waiting for the disk to say, "OK, I did it."  With BBU,
> once the RAID card has the data, it's virtually guaranteed it will get
> to the disk even if the power fails, so the RAID controller says, "OK, I
> did it" even though the data is still in the controller's cache and not
> actually on the disk.
>
> It means there's no tight relationship between the disk's rotational
> speed and your transaction rate.
>
> Craig
>


От:
David Boreham
Дата:

On 7/8/2010 3:18 PM,  wrote:
> How does the linux machine know that there is a BBU installed and to
> change its behavior or change the behavior of Postgres? I am
> experiencing performance issues, not with searching but more with IO.
>
It doesn't change its behavior at all. It's in the business of writing
stuff to a file and waiting until that stuff has been put on the disk
(it wants a durable write). What the write buffer/cache does is to
inform the OS, and hence PG, that the write has been done when in fact
it hasn't (yet). So the change in behavior is only to the extent that
the application doesn't spend as much time waiting.



От:
"Joshua D. Drake"
Дата:

On Thu, 2010-07-08 at 09:31 -0700, Ryan Wexler wrote:
> The raid card the server has in it is:
> 3Ware 4 Port 9650SE-4LPML RAID Card
>
> Looking it up, it seems to indicate that it has BBU

No. It supports a BBU. It doesn't have one necessarily.

You need to go into your RAID BIOS. It will tell you.

Sincerely,

Joshua D. Drake


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering


От:
Russell Smith
Дата:

On 09/07/10 02:31, Ryan Wexler wrote:
Thanks a lot for all the comments.  The fact that both my windows box and the old linux box both show a massive performance improvement over the new linux box seems to point to hardware to me.  I am not sure how to test the fsync issue, but i don't see how that could be it.

The raid card the server has in it is:
3Ware 4 Port 9650SE-4LPML RAID Card

Looking it up, it seems to indicate that it has BBU

The only other difference between the boxes is the postgresql version.  The new one has 8.4-2 from the yum install instructions on the site:
http://yum.pgrpms.org/reporpms/repoview/pgdg-centos.html

Any more thoughts?
Really dumb idea, you don't happen to have the build of the RPM's that had debug enabled do you?  That resulted in significant performance problem?

Regards

Russell
От:
Samuel Gendler
Дата:

On Fri, Jul 9, 2010 at 2:08 AM, Russell Smith <> wrote:
> On 09/07/10 02:31, Ryan Wexler wrote:
>
>
> The only other difference between the boxes is the postgresql version.  The
> new one has 8.4-2 from the yum install instructions on the site:
> http://yum.pgrpms.org/reporpms/repoview/pgdg-centos.html
>
> Any more thoughts?
>
> Really dumb idea, you don't happen to have the build of the RPM's that had
> debug enabled do you?  That resulted in significant performance problem?
>

The OP mentions that the new system underperforms on a straight dd
test, so it isn't the database config or postgres build.

От:
Ryan Wexler
Дата:



On Fri, Jul 9, 2010 at 2:38 AM, Samuel Gendler <> wrote:
On Fri, Jul 9, 2010 at 2:08 AM, Russell Smith <> wrote:
> On 09/07/10 02:31, Ryan Wexler wrote:
>
>
> The only other difference between the boxes is the postgresql version.  The
> new one has 8.4-2 from the yum install instructions on the site:
> http://yum.pgrpms.org/reporpms/repoview/pgdg-centos.html
>
> Any more thoughts?
>
> Really dumb idea, you don't happen to have the build of the RPM's that had
> debug enabled do you?  That resulted in significant performance problem?
>

The OP mentions that the new system underperforms on a straight dd
test, so it isn't the database config or postgres build.

--
Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Well I got me a new raid card, MegaRAID 8708EM2, fully equipped with BBU and read and write caching are enabled.  It completely solved my performance problems.  Now everything is way faster than the previous server.  Thanks for all the help everyone.

One question I do have is this card has a setting called Read Policy which apparently helps with sequentially reads.  Do you think that is something I should enable?



От:
Greg Smith
Дата:

Ryan Wexler wrote:
> One question I do have is this card has a setting called Read Policy
> which apparently helps with sequentially reads.  Do you think that is
> something I should enable?

Linux will do some amount of read-ahead in a similar way on its own.
You run "blockdev --getra" and "blockdev --setra" on each disk device on
the system to see the settings and increase them.  I've found that
tweaking there, where you can control exactly the amount of readahead,
to be more effective than relying on the less tunable Read Policy modes
in RAID cards that do something similar.  That said, it doesn't seem to
hurt to use both on the LSI card you have; giving more information there
to the controller for its use in optimizing how it caches things, by
changing to the more aggressive Read Policy setting, hasn't ever
degraded results significantly when I've tried.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


От:
Andy Colson
Дата:

On 07/11/2010 03:02 PM, Ryan Wexler wrote:

>
> Well I got me a new raid card, MegaRAID 8708EM2, fully equipped with
> BBU and read and write caching are enabled.  It completely solved my
> performance problems.  Now everything is way faster than the previous
> server.  Thanks for all the help everyone.
>
> One question I do have is this card has a setting called Read Policy
> which apparently helps with sequentially reads.  Do you think that is
> something I should enable?
>
>
>

I would think it depends on your usage.  If you use clustered indexes (and understand how/when they help) then enabling
itwould help (cuz clustered is assuming sequential reads). 

or if you seq scan a table, it might help (as long as the table is stored relatively close together).

But if you have a big db, that doesnt fit into cache, and you bounce all over the place doing seeks, I doubt it'll
help.

-Andy

От:
Scott Carey
Дата:

But none of this explains why a 4-disk raid 10 is slower than a 1 disk system.  If there is no write-back caching on
theRAID, it should still be similar to the one disk setup. 

Unless that one-disk setup turned off fsync() or was configured with synchronous_commit off.  Even low end laptop
drivesdon't lie these days about a cache flush or sync() -- OS's/file systems can, and some SSD's do. 

If loss of a transaction during a power failure is OK, then just turn synchronous_commit off and get the performance
back. The discussion about transaction rate being limited by the disks is related to that, and its not necessary _IF_
itsok to lose a transaction if the power fails.  For most applications, losing a transaction or two in a power failure
isfine.  Obviously, its not with financial transactions or other such work. 


On Jul 8, 2010, at 2:42 PM, Craig James wrote:

> On 7/8/10 2:18 PM,  wrote:
>> How does the linux machine know that there is a BBU installed and to
>> change its behavior or change the behavior of Postgres? I am
>> experiencing performance issues, not with searching but more with IO.
>
> It doesn't.  It trusts the disk controller.  Linux says, "Flush your cache" and the controller says, "OK, it's
flushed." In the case of a BBU controller, the controller can say that almost instantly because it's got the data in a
battery-backedmemory that will survive even if the power goes out.  In the case of a non-BBU controller (RAID or
non-RAID),the controller has to actually wait for the head to move to the right spot, then wait for the disk to spin
aroundto the right sector, then write the data.  Only then can it say, "OK, it's flushed." 
>
> So to Linux, it just appears to be a disk that's exceptionally fast at flushing its buffers.
>
> Craig
>
>>
>> -----Original Message-----
>> From: 
>> [mailto:] On Behalf Of Craig James
>> Sent: Thursday, July 08, 2010 4:02 PM
>> To: 
>> Subject: Re: [PERFORM] performance on new linux box
>>
>> On 7/8/10 12:47 PM, Ryan Wexler wrote:
>>>
>>>
>>> On Thu, Jul 8, 2010 at 12:46 PM, Kevin Grittner
>>> <<mailto:>>
>> wrote:
>>>
>>>     Ryan Wexler<<mailto:>>
>>>     wrote:
>>>
>>>> One thing I don't understand is why BBU will result in a huge
>>>> performance gain.  I thought BBU was all about power failures?
>>>
>>>     Well, it makes it safe for the controller to consider the write
>>>     complete as soon as it hits the RAM cache, rather than waiting for
>>>     persistence to the disk itself.  It can then schedule the writes
>> in
>>>     a manner which is efficient based on the physical medium.
>>>
>>>     Something like this was probably happening on your non-server
>>>     machines, but without BBU it was not actually safe.  Server class
>>>     machines tend to be more conservative about not losing your data,
>>>     but without a RAID controller with BBU cache, that slows writes
>> down
>>>     to the speed of the rotating disks.
>>>
>>>     -Kevin
>>>
>>> Thanks for the explanations that makes things clearer.  It still
>> amazes
>>> me that it would account for a 5x change in IO.
>>
>> It's not exactly a 5x change in I/O, rather it's a 5x change in
>> *transactions*.  Without a BBU Postgres has to wait for each transaction
>> to by physically written to the disk, which at 7200 RPM (or 10K or 15K)
>> means a few hundred per second.  Most of the time Postgres is just
>> sitting there waiting for the disk to say, "OK, I did it."  With BBU,
>> once the RAID card has the data, it's virtually guaranteed it will get
>> to the disk even if the power fails, so the RAID controller says, "OK, I
>> did it" even though the data is still in the controller's cache and not
>> actually on the disk.
>>
>> It means there's no tight relationship between the disk's rotational
>> speed and your transaction rate.
>>
>> Craig
>>
>
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


От:
Ben Chobot
Дата:

On Jul 14, 2010, at 6:57 PM, Scott Carey wrote:

> But none of this explains why a 4-disk raid 10 is slower than a 1 disk system.  If there is no write-back caching on
theRAID, it should still be similar to the one disk setup. 

Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature on
theirown buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache on
thedrives. 

Take away the controller, and most OS's by default enable the write cache on the drive. You can turn it off if you
want,but if you know how to do that, then you're probably also the same kind of person that would have purchased a raid
cardwith a BBU. 

От:
Ryan Wexler
Дата:

On Wed, Jul 14, 2010 at 6:57 PM, Scott Carey <> wrote:
But none of this explains why a 4-disk raid 10 is slower than a 1 disk system.  If there is no write-back caching on the RAID, it should still be similar to the one disk setup.

Unless that one-disk setup turned off fsync() or was configured with synchronous_commit off.  Even low end laptop drives don't lie these days about a cache flush or sync() -- OS's/file systems can, and some SSD's do.

If loss of a transaction during a power failure is OK, then just turn synchronous_commit off and get the performance back.  The discussion about transaction rate being limited by the disks is related to that, and its not necessary _IF_ its ok to lose a transaction if the power fails.  For most applications, losing a transaction or two in a power failure is fine.  Obviously, its not with financial transactions or other such work.


On Jul 8, 2010, at 2:42 PM, Craig James wrote:

> On 7/8/10 2:18 PM, wrote:
>> How does the linux machine know that there is a BBU installed and to
>> change its behavior or change the behavior of Postgres? I am
>> experiencing performance issues, not with searching but more with IO.
>
> It doesn't.  It trusts the disk controller.  Linux says, "Flush your cache" and the controller says, "OK, it's flushed."  In the case of a BBU controller, the controller can say that almost instantly because it's got the data in a battery-backed memory that will survive even if the power goes out.  In the case of a non-BBU controller (RAID or non-RAID), the controller has to actually wait for the head to move to the right spot, then wait for the disk to spin around to the right sector, then write the data.  Only then can it say, "OK, it's flushed."
>
> So to Linux, it just appears to be a disk that's exceptionally fast at flushing its buffers.
>
> Craig
>
>>
>> -----Original Message-----
>> From:
>> [mailto:] On Behalf Of Craig James
>> Sent: Thursday, July 08, 2010 4:02 PM
>> To:
>> Subject: Re: [PERFORM] performance on new linux box
>>
>> On 7/8/10 12:47 PM, Ryan Wexler wrote:
>>>
>>>
>>> On Thu, Jul 8, 2010 at 12:46 PM, Kevin Grittner
>>> <<mailto:>>
>> wrote:
>>>
>>>     Ryan Wexler<<mailto:>>
>>>     wrote:
>>>
>>>> One thing I don't understand is why BBU will result in a huge
>>>> performance gain.  I thought BBU was all about power failures?
>>>
>>>     Well, it makes it safe for the controller to consider the write
>>>     complete as soon as it hits the RAM cache, rather than waiting for
>>>     persistence to the disk itself.  It can then schedule the writes
>> in
>>>     a manner which is efficient based on the physical medium.
>>>
>>>     Something like this was probably happening on your non-server
>>>     machines, but without BBU it was not actually safe.  Server class
>>>     machines tend to be more conservative about not losing your data,
>>>     but without a RAID controller with BBU cache, that slows writes
>> down
>>>     to the speed of the rotating disks.
>>>
>>>     -Kevin
>>>
>>> Thanks for the explanations that makes things clearer.  It still
>> amazes
>>> me that it would account for a 5x change in IO.
>>
>> It's not exactly a 5x change in I/O, rather it's a 5x change in
>> *transactions*.  Without a BBU Postgres has to wait for each transaction
>> to by physically written to the disk, which at 7200 RPM (or 10K or 15K)
>> means a few hundred per second.  Most of the time Postgres is just
>> sitting there waiting for the disk to say, "OK, I did it."  With BBU,
>> once the RAID card has the data, it's virtually guaranteed it will get
>> to the disk even if the power fails, so the RAID controller says, "OK, I
>> did it" even though the data is still in the controller's cache and not
>> actually on the disk.
>>
>> It means there's no tight relationship between the disk's rotational
>> speed and your transaction rate.
>>
>> Craig
>>
>
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


--
Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Something was clearly wrong with my former raid card.  Frankly, I am not sure if it was configuration or simply hardware failure.  The server is hosted so I only had so much access.  But the card was swapped out with a new one and now performance is quite good.  I am just trying to tune the new card now.
thanks for all the input

От:
Scott Carey
Дата:

On Jul 14, 2010, at 7:50 PM, Ben Chobot wrote:

> On Jul 14, 2010, at 6:57 PM, Scott Carey wrote:
>
>> But none of this explains why a 4-disk raid 10 is slower than a 1 disk system.  If there is no write-back caching on
theRAID, it should still be similar to the one disk setup. 
>
> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature
ontheir own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache
onthe drives. 

This does not make sense.
Write caching on all hard drives in the last decade are safe because they support a write cache flush command properly.
If the card is "smart" it would issue the drive's write cache flush command to fulfill an fsync() or barrier request
withno BBU. 

>
> Take away the controller, and most OS's by default enable the write cache on the drive. You can turn it off if you
want,but if you know how to do that, then you're probably also the same kind of person that would have purchased a raid
cardwith a BBU. 

Sure, or you can use an OS/File System combination that respects fsync() which will call the drive's write cache flush.
 There are some issues with certain file systems and barriers for file system metadata, but for the WAL log, we're only
dalkingabout fdatasync() equivalency, which most file systems do just fine even with a drive's write cache on. 


От:
Ben Chobot
Дата:

On Jul 15, 2010, at 9:30 AM, Scott Carey wrote:

>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature
ontheir own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache
onthe drives. 
>
> This does not make sense.
> Write caching on all hard drives in the last decade are safe because they support a write cache flush command
properly. If the card is "smart" it would issue the drive's write cache flush command to fulfill an fsync() or barrier
requestwith no BBU. 

You're missing the point. If the power dies suddenly, there's no time to flush any cache anywhere. That's the entire
pointof the BBU - it keeps the RAM powered up on the raid card. It doesn't keep the disks spinning long enough to flush
caches.

От:
Ben Chobot
Дата:


On Jul 15, 2010, at 12:40 PM, Ryan Wexler wrote:

On Wed, Jul 14, 2010 at 7:50 PM, Ben Chobot <> wrote:
On Jul 14, 2010, at 6:57 PM, Scott Carey wrote:

> But none of this explains why a 4-disk raid 10 is slower than a 1 disk system.  If there is no write-back caching on the RAID, it should still be similar to the one disk setup.

Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature on their own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache on the drives.

Take away the controller, and most OS's by default enable the write cache on the drive. You can turn it off if you want, but if you know how to do that, then you're probably also the same kind of person that would have purchased a raid card with a BBU.
--
Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Ben I don't quite follow your message.   Could you spell it out a little clearer for me?
thanks
-ryan


Most (all?) hard drives have cache built into them. Many raid cards have cache built into them. When the power dies, all the data in any cache is lost, which is why it's dangerous to use it for write caching. For that reason, you can attach a BBU to a raid card which keeps the cache alive until the power is restored (hopefully). But no hard drive I am aware of lets you attach a battery, so using a hard drive's cache for write caching will always be dangerous.

That's why many raid cards will always disable write caching on the hard drives themselves, and only enable write caching using their own memory when a BBU is installed. 

Does that make more sense?

От:
Ryan Wexler
Дата:



On Thu, Jul 15, 2010 at 12:35 PM, Ben Chobot <> wrote:
On Jul 15, 2010, at 9:30 AM, Scott Carey wrote:

>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature on their own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache on the drives.
>
> This does not make sense.
> Write caching on all hard drives in the last decade are safe because they support a write cache flush command properly.  If the card is "smart" it would issue the drive's write cache flush command to fulfill an fsync() or barrier request with no BBU.

You're missing the point. If the power dies suddenly, there's no time to flush any cache anywhere. That's the entire point of the BBU - it keeps the RAM powered up on the raid card. It doesn't keep the disks spinning long enough to flush caches.
--
Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

So you are saying write caching is a dangerous proposition on a raid card with or without BBU?
От:
Ben Chobot
Дата:

On Jul 15, 2010, at 2:40 PM, Ryan Wexler wrote:

On Thu, Jul 15, 2010 at 12:35 PM, Ben Chobot <> wrote:
On Jul 15, 2010, at 9:30 AM, Scott Carey wrote:

>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature on their own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache on the drives.
>
> This does not make sense.
> Write caching on all hard drives in the last decade are safe because they support a write cache flush command properly.  If the card is "smart" it would issue the drive's write cache flush command to fulfill an fsync() or barrier request with no BBU.

You're missing the point. If the power dies suddenly, there's no time to flush any cache anywhere. That's the entire point of the BBU - it keeps the RAM powered up on the raid card. It doesn't keep the disks spinning long enough to flush caches.
--
Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

So you are saying write caching is a dangerous proposition on a raid card with or without BBU?


Er, no, sorry, I am not being very clear it seems. 


Using a cache for write caching is dangerous, unless you protect it with a battery. Caches on a raid card can be protected by a BBU, so, when you use a BBU, write caching on the raid card is safe. (Just don't read the firmware changelog for your raid card or you will always be paranoid.) If you don't have a BBU, many raid cards default to disabling caching. You can still enable it, but the card will often tell you it's a bad idea.

There are also caches on all your disk drives. Write caching there is always dangerous, which is why almost all raid cards always disable the hard drive write caching, with or without a BBU. I'm not even sure how many raid cards let you enable the write cache on a drive... hopefully, not many.
От:
"Pierre C"
Дата:

> Most (all?) hard drives have cache built into them. Many raid cards have
> cache built into them. When the power dies, all the data in any cache is
> lost, which is why it's dangerous to use it for write caching. For that
> reason, you can attach a BBU to a raid card which keeps the cache alive
> until the power is restored (hopefully). But no hard drive I am aware of
> lets you attach a battery, so using a hard drive's cache for write
> caching will always be dangerous.
>
> That's why many raid cards will always disable write caching on the hard
> drives themselves, and only enable write caching using their own memory
> when a BBU is installed.
>
> Does that make more sense?
>

Actually write cache is only dangerous if the OS and postgres think some
stuff is written to the disk when in fact it is only in the cache and not
written yet. When power is lost, cache contents are SUPPOSED to be lost.
In a normal situation, postgres and the OS assume nothing is written to
the disk (ie, it may be in cache not on disk) until a proper cache flush
is issued and responded to by the hardware. That's what xlog and journals
are for. If the hardware doesn't lie, and the kernel/FS doesn't have any
bugs, no problem. You can't get decent write performance on rotating media
without a write cache somewhere...


От:
Scott Marlowe
Дата:

On Thu, Jul 15, 2010 at 10:30 AM, Scott Carey <> wrote:
>
> On Jul 14, 2010, at 7:50 PM, Ben Chobot wrote:
>
>> On Jul 14, 2010, at 6:57 PM, Scott Carey wrote:
>>
>>> But none of this explains why a 4-disk raid 10 is slower than a 1 disk system.  If there is no write-back caching
onthe RAID, it should still be similar to the one disk setup. 
>>
>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature
ontheir own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache
onthe drives. 
>
> This does not make sense.

Basically, you can have cheap, fast and dangerous (drive with write
cache enabled, which responds positively to fsync even when it hasn't
actually fsynced the data.  You can have cheap, slow and safe with a
drive that has a cache but since it'll be fsyncing it all the the time
the write cache won't actually get used, or fast, expensive, and safe,
which is what a BBU RAID card gets by saying the data is fsynced when
it's actually just in cache, but a safe cache that won't get lost on
power down.

I don't find it that complicated.

От:
Scott Carey
Дата:

On Jul 15, 2010, at 12:35 PM, Ben Chobot wrote:

> On Jul 15, 2010, at 9:30 AM, Scott Carey wrote:
>
>>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature
ontheir own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache
onthe drives. 
>>
>> This does not make sense.
>> Write caching on all hard drives in the last decade are safe because they support a write cache flush command
properly. If the card is "smart" it would issue the drive's write cache flush command to fulfill an fsync() or barrier
requestwith no BBU. 
>
> You're missing the point. If the power dies suddenly, there's no time to flush any cache anywhere. That's the entire
pointof the BBU - it keeps the RAM powered up on the raid card. It doesn't keep the disks spinning long enough to flush
caches.

If the power dies suddenly, then the data that is in the OS RAM will also be lost.  What about that?

Well it doesn't matter because the DB is only relying on data being persisted to disk that it thinks has been persisted
todisk via fsync(). 

The data in the disk cache is the same thing as RAM.  As long as fsync() works _properly_ which is true for any file
system+ disk combination with a damn (not HFS+ on OSX, not FAT, not a few other things), then it will tell the drive to
flushits cache _before_ fsync() returns.  There is NO REASON for a raid card to turn off a drive cache unless it does
nottrust the drive cache.  In write-through mode, it should not return to the OS with a fsync, direct write, or other
"theOS thinks this data is persisted now" call until it has flushed the disk cache.  That does not mean it has to turn
offthe disk cache. 

От:
Scott Carey
Дата:

On Jul 15, 2010, at 6:22 PM, Scott Marlowe wrote:

> On Thu, Jul 15, 2010 at 10:30 AM, Scott Carey <> wrote:
>>
>> On Jul 14, 2010, at 7:50 PM, Ben Chobot wrote:
>>
>>> On Jul 14, 2010, at 6:57 PM, Scott Carey wrote:
>>>
>>>> But none of this explains why a 4-disk raid 10 is slower than a 1 disk system.  If there is no write-back caching
onthe RAID, it should still be similar to the one disk setup. 
>>>
>>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature
ontheir own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache
onthe drives. 
>>
>> This does not make sense.
>
> Basically, you can have cheap, fast and dangerous (drive with write
> cache enabled, which responds positively to fsync even when it hasn't
> actually fsynced the data.  You can have cheap, slow and safe with a
> drive that has a cache but since it'll be fsyncing it all the the time
> the write cache won't actually get used, or fast, expensive, and safe,
> which is what a BBU RAID card gets by saying the data is fsynced when
> it's actually just in cache, but a safe cache that won't get lost on
> power down.
>
> I don't find it that complicated.

It doesn't make sense that a raid 10 will be slower than a 1-disk setup unless the former respects fsync() and the
latterdoes not.  Individual drive write cache does not explain the situation.  That is what does not make sense. 

When in _write-through_ mode, there is no reason to turn off the drive's write cache unless the drive does not properly
respectits cache-flush command, or the RAID card is too dumb to issue cache-flush commands.  The RAID card simply has
toissue its writes, then issue the flush commands, then return to the OS when those complete.  With drive write caches
on,this is perfectly safe.  The only way it is unsafe is if the drive lies and returns from a cache flush before the
datafrom its cache is actually flushed. 

Some SSD's on the market currently lie.  A handful of the thousands of all hard drive models in the server, desktop,
andlaptop space in the last decade did not respect the cache flush command properly, and none of them in the SAS/SCSI
or'enterprise SATA' space lie to my knowledge.  Information on this topic has come across this list several times. 

The explanation why one setup respects fsync() and another does not almost always lies in the FS + OS combination.
HFS+on OSX does not respect fsync.  ext3 until recently only did fdatasync() when you told it to fsync() (which is fine
forpostgres' transaction log anyway). 

A raid card, especially with any SAS/SCSI drives has no reason to turn off the drive's write cache unless it _wants_ to
returnto the OS before the data is on the drive.  That condition occurs in write-back cache mode when the RAID card's
cacheis safe via a battery or some other mechanism.  In that case, it should turn off the drive's write cache so that
itcan be sure  that data is on disk when a power fails without having to call the cache-flush command on every write.
Thatway, it can remove data from its RAM as soon as the drive returns from the write. 
In write-through mode it should turn the caches back on and rely on the flush command to pass through direct writes,
cacheflush demands, and barrier requests.  It could optionally turn the caches off, but that won't improve data safety
unlessthe drive cannot faithfully flush its cache. 



От:
Ben Chobot
Дата:

On Jul 15, 2010, at 8:16 PM, Scott Carey wrote:

> On Jul 15, 2010, at 12:35 PM, Ben Chobot wrote:
>
>> On Jul 15, 2010, at 9:30 AM, Scott Carey wrote:
>>
>>>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the
featureon their own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the
cacheon the drives. 
>>>
>>> This does not make sense.
>>> Write caching on all hard drives in the last decade are safe because they support a write cache flush command
properly. If the card is "smart" it would issue the drive's write cache flush command to fulfill an fsync() or barrier
requestwith no BBU. 
>>
>> You're missing the point. If the power dies suddenly, there's no time to flush any cache anywhere. That's the entire
pointof the BBU - it keeps the RAM powered up on the raid card. It doesn't keep the disks spinning long enough to flush
caches.
>
> If the power dies suddenly, then the data that is in the OS RAM will also be lost.  What about that?
>
> Well it doesn't matter because the DB is only relying on data being persisted to disk that it thinks has been
persistedto disk via fsync(). 

Right, we agree that only what has been fsync()'d has a chance to be safe....

> The data in the disk cache is the same thing as RAM.  As long as fsync() works _properly_ which is true for any file
system+ disk combination with a damn (not HFS+ on OSX, not FAT, not a few other things), then it will tell the drive to
flushits cache _before_ fsync() returns.  There is NO REASON for a raid card to turn off a drive cache unless it does
nottrust the drive cache.  In write-through mode, it should not return to the OS with a fsync, direct write, or other
"theOS thinks this data is persisted now" call until it has flushed the disk cache.  That does not mean it has to turn
offthe disk cache. 

...and here you are also right in that a write-through write cache is safe, with or without a battery. A write-through
cacheis a win for things that don't often fsync, but my understanding is that with a database, you end up fsyncing all
thetime, which makes a write-through cache not worth very much. The only good way to get good *database* performance
outof spinning media is with a write-back cache, and the only way to make that safe is to hook up a BBU. 


От:
Craig Ringer
Дата:

On 16/07/10 06:18, Ben Chobot wrote:

> There are also caches on all your disk drives. Write caching there is always dangerous, which is why almost all raid
cardsalways disable the hard drive write caching, with or without a BBU. I'm not even sure how many raid cards let you
enablethe write cache on a drive... hopefully, not many. 

AFAIK Disk drive caches can be safe to leave in write-back mode (ie
write cache enabled) *IF* the OS uses write barriers (properly) and the
drive understands them.

Big if.

--
Craig Ringer

От:
Craig Ringer
Дата:

On 16/07/10 09:22, Scott Marlowe wrote:
> On Thu, Jul 15, 2010 at 10:30 AM, Scott Carey <> wrote:
>>
>> On Jul 14, 2010, at 7:50 PM, Ben Chobot wrote:
>>
>>> On Jul 14, 2010, at 6:57 PM, Scott Carey wrote:
>>>
>>>> But none of this explains why a 4-disk raid 10 is slower than a 1 disk system.  If there is no write-back caching
onthe RAID, it should still be similar to the one disk setup. 
>>>
>>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature
ontheir own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache
onthe drives. 
>>
>> This does not make sense.
>
> Basically, you can have cheap, fast and dangerous (drive with write
> cache enabled, which responds positively to fsync even when it hasn't
> actually fsynced the data.  You can have cheap, slow and safe with a
> drive that has a cache but since it'll be fsyncing it all the the time
> the write cache won't actually get used, or fast, expensive, and safe,
> which is what a BBU RAID card gets by saying the data is fsynced when
> it's actually just in cache, but a safe cache that won't get lost on
> power down.

Speaking of BBUs... do you ever find yourself wishing you could use
software RAID with battery backup?

I tend to use software RAID quite heavily on non-database servers, as
it's cheap, fast, portable from machine to machine, and (in the case of
Linux 'md' raid) reliable. Alas, I can't really use it for DB servers
due to the need for write-back caching.

There's no technical reason I know of why sw raid couldn't write-cache
to some non-volatile memory on the host. A dedicated  a battery-backed
pair of DIMMS on a PCI-E card mapped into memory would be ideal. Failing
that, a PCI-E card with onboard RAM+BATT or fast flash that presents an
AHCI interface so it can be used as a virtual HDD would do pretty well.
Even one of those SATA "RAM Drive" units would do the job, though
forcing everything though the SATA2 bus would be a performance downside.

The only issue I see with sw raid write caching is that it probably
couldn't be done safely on the root file system. The OS would have to
come up, init software raid, and find the caches before it'd be safe to
read or write volumes with s/w raid write caching enabled. It's not the
sort of thing that'd be practical to implement in GRUB's raid support.

--
Craig Ringer

От:
Greg Smith
Дата:

Scott Carey wrote:
> As long as fsync() works _properly_ which is true for any file system + disk combination with a damn (not HFS+ on
OSX,not FAT, not a few other things), then it will tell the drive to flush its cache _before_ fsync() returns.  There
isNO REASON for a raid card to turn off a drive cache unless it does not trust the drive cache.  In write-through mode,
itshould not return to the OS with a fsync, direct write, or other "the OS thinks this data is persisted now" call
untilit has flushed the disk cache.  That does not mean it has to turn off the disk cache. 
>

Assuming that the operating system will pass through fsync calls to
flush data all the way to drive level in all situations is an extremely
dangerous assumption.  Most RAID controllers don't know how to force
things out of the individual drive caches; that's why they turn off
write caching on them.  Few filesystems get the details right to handle
individual drive cache flushing correctly.  On Linux, XFS and ext4 are
the only two with any expectation that will happen, and of those two
ext4 is still pretty new and therefore should still be presumed to be buggy.

Please don't advise people about what is safe based on theoretical
grounds here, in practice there are way too many bugs in the
implementation of things like drive barriers to trust them most of the
time.  There is no substitute for a pull the plug test using something
that looks for bad cache flushes, i.e. diskchecker.pl:
http://brad.livejournal.com/2116715.html  If you do that you'll discover
you must turn off the individual drive caches when using a
battery-backed RAID controller, and you can't ever trust barriers on
ext3 because of bugs that were only fixed in ext4.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us