Обсуждение: PG8 Tuning

От:
"Paul Johnson"
Дата:

Hi all, we're running PG8 on a Sun V250 with 8GB RAM and 2*1.3GHz SPARC
CPUs running Solaris 10. The DB cluster is on an external fibre-attached
Sun T3 array that has 9*36GB drives configured as a single RAID5 LUN.

The system is for the sole use of a couple of data warehouse developers,
hence we are keen to use 'aggressive' tuning options to maximise
performance.

So far we have made the following changes and measured the impact on our
test suite:

1) Increase checkpoint_segments from 3 to 64. This made a 10x improvement
in some cases.

2) Increase work_mem from 1,024 to 524,288.

3) Increase shared_buffers from 1,000 to 262,143 (2 GB). This required
setting SHMMAX=4294967295 (4 GB) in /etc/system and re-booting the box.

Question - can Postgres only use 2GB RAM, given that shared_buffers can
only be set as high as 262,143 (8K pages)?

So far so good...

4) Move /pg_xlog to an internal disk within the V250. This has had a
severe *negative* impact on performance. Copy job has gone from 2 mins to
12 mins, simple SQL job gone from 1 min to 7 mins. Not even run long SQL
jobs.

I'm guessing that this is because pg_xlog has gone from a 9 spindle LUN to
a single spindle disk?

In cases such as this, where an external storage array with a hardware
RAID controller is used, the normal advice to separate the data from the
pg_xlog  seems to come unstuck, or are we missing something?

Cheers,

Paul Johnson.

От:
Richard Huxton
Дата:

Paul Johnson wrote:
> Hi all, we're running PG8 on a Sun V250 with 8GB RAM and 2*1.3GHz SPARC
> CPUs running Solaris 10. The DB cluster is on an external fibre-attached
> Sun T3 array that has 9*36GB drives configured as a single RAID5 LUN.
>
> The system is for the sole use of a couple of data warehouse developers,
> hence we are keen to use 'aggressive' tuning options to maximise
> performance.
>
> So far we have made the following changes and measured the impact on our
> test suite:
>
> 1) Increase checkpoint_segments from 3 to 64. This made a 10x improvement
> in some cases.

OK

> 2) Increase work_mem from 1,024 to 524,288.

Don't forget you can use multiples of this in a single query. Might want
to reign it back a bit. I *think* you can set it per-query if you want
anyway.

> 3) Increase shared_buffers from 1,000 to 262,143 (2 GB). This required
> setting SHMMAX=4294967295 (4 GB) in /etc/system and re-booting the box.
>
> Question - can Postgres only use 2GB RAM, given that shared_buffers can
> only be set as high as 262,143 (8K pages)?

Well, normally you'd want to keep a fair bit for the O.S. to cache data.
One quarter of your RAM seems very high. Did you try 5000,10000,50000
too or go straight to the top end?

> So far so good...
>
> 4) Move /pg_xlog to an internal disk within the V250. This has had a
> severe *negative* impact on performance. Copy job has gone from 2 mins to
> 12 mins, simple SQL job gone from 1 min to 7 mins. Not even run long SQL
> jobs.
>
> I'm guessing that this is because pg_xlog has gone from a 9 spindle LUN to
> a single spindle disk?

The key limitation will be one commit per rotation of the disk. Multiple
spindles, or better still with a battery-backed write-cache will give
you peak transactions.

> In cases such as this, where an external storage array with a hardware
> RAID controller is used, the normal advice to separate the data from the
> pg_xlog  seems to come unstuck, or are we missing something?

Well, I think the advice then is actually "get 2 external arrays..."

--
   Richard Huxton
   Archonet Ltd

От:
Steve Poe
Дата:

Paul,

Before I say anything else, one online document which may be of
assistance to you is:
http://www.powerpostgresql.com/PerfList/

Some thoughts I have:

3) You're shared RAM setting seems overkill to me. Part of the challenge
is you're going from 1000 to 262K with no assessment in between. Each
situation can be different, but try in the range of 10 - 50K.

4) pg_xlog: If you're pg_xlog is on a spindle is *only* for pg_xlog
you're better off. If it is sharing with any other OS/DB resource, the
performance will be impacted.

From what I have learned from others on this list, RAID5 is not the best
choice for the database. RAID10 would be a better solution (using 8 of
your disks) then take the remaining disk and do mirror with your pg_xlog
if possible.

Best of luck,

Steve Poe

On Thu, 2005-08-11 at 13:23 +0100, Paul Johnson wrote:
> Hi all, we're running PG8 on a Sun V250 with 8GB RAM and 2*1.3GHz SPARC
> CPUs running Solaris 10. The DB cluster is on an external fibre-attached
> Sun T3 array that has 9*36GB drives configured as a single RAID5 LUN.
>
> The system is for the sole use of a couple of data warehouse developers,
> hence we are keen to use 'aggressive' tuning options to maximise
> performance.
>
> So far we have made the following changes and measured the impact on our
> test suite:
>
> 1) Increase checkpoint_segments from 3 to 64. This made a 10x improvement
> in some cases.
>
> 2) Increase work_mem from 1,024 to 524,288.
>
> 3) Increase shared_buffers from 1,000 to 262,143 (2 GB). This required
> setting SHMMAX=4294967295 (4 GB) in /etc/system and re-booting the box.
>
> Question - can Postgres only use 2GB RAM, given that shared_buffers can
> only be set as high as 262,143 (8K pages)?
>
> So far so good...
>
> 4) Move /pg_xlog to an internal disk within the V250. This has had a
> severe *negative* impact on performance. Copy job has gone from 2 mins to
> 12 mins, simple SQL job gone from 1 min to 7 mins. Not even run long SQL
> jobs.
>
> I'm guessing that this is because pg_xlog has gone from a 9 spindle LUN to
> a single spindle disk?
>
> In cases such as this, where an external storage array with a hardware
> RAID controller is used, the normal advice to separate the data from the
> pg_xlog  seems to come unstuck, or are we missing something?
>
> Cheers,
>
> Paul Johnson.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org


От:
Michael Stone
Дата:

On Thu, Aug 11, 2005 at 01:23:21PM +0100, Paul Johnson wrote:
>I'm guessing that this is because pg_xlog has gone from a 9 spindle LUN to
>a single spindle disk?
>
>In cases such as this, where an external storage array with a hardware
>RAID controller is used, the normal advice to separate the data from the
>pg_xlog  seems to come unstuck

Yes. That's the downside to dogma. If you're writing pg_xlog to a
battery-backed ram buffer you'll see faster commits than you will with a
write to a disk, even if you've got a dedicated spindle, unless you've
got constant write activity. (Because once the buffer fills you're
limited to disk speed as you wait for buffer flushes.) If you've got a
lot of system RAM, a battery-backed disk buffer, an OS/filesystem than
effectively delays writes, and bursty transactional writes it's quite
possible you'll get better performance putting everything on one array
rather than breaking it up to follow the "rules". You might get a
performance boost by putting the transaction log on a seperate partition
or lun on the external array, depending on how the fs implements syncs
or whether you can optimize the filsystem choice for each partition. The
correct approach is to run comparative benchmarks of each configuration.
:-)

Mike Stone

От:
"Jeffrey W. Baker"
Дата:

On Fri, 2005-08-12 at 08:47 +0000, Steve Poe wrote:
> Paul,
>
> Before I say anything else, one online document which may be of
> assistance to you is:
> http://www.powerpostgresql.com/PerfList/
>
> Some thoughts I have:
>
> 3) You're shared RAM setting seems overkill to me. Part of the challenge
> is you're going from 1000 to 262K with no assessment in between. Each
> situation can be different, but try in the range of 10 - 50K.
>
> 4) pg_xlog: If you're pg_xlog is on a spindle is *only* for pg_xlog
> you're better off.

Like Mr. Stone said earlier, this is pure dogma.  In my experience,
xlogs on the same volume with data is much faster if both are on
battery-backed write-back RAID controller memory.  Moving from this
situation to xlogs on a single normal disk is going to be much slower in
most cases.

-jwb

От:
Tom Arthurs
Дата:

I think the T-3 RAID at least breaks some of these rules -- I've got 2
T-3's, 1 configured as RAID-10 and the other as RAID5, and they both
seem to perform about the same.  I use RAID5 with a hot spare, so it's
using 8 spindles.

I got a lot of performance improvement out of mount the fs noatime and
turning journaling off.  Of course it takes a *long* time to recover
from a crash.

Steve Poe wrote:
> Paul,
>
> Before I say anything else, one online document which may be of
> assistance to you is:
> http://www.powerpostgresql.com/PerfList/
>
> Some thoughts I have:
>
> 3) You're shared RAM setting seems overkill to me. Part of the challenge
> is you're going from 1000 to 262K with no assessment in between. Each
> situation can be different, but try in the range of 10 - 50K.
>
> 4) pg_xlog: If you're pg_xlog is on a spindle is *only* for pg_xlog
> you're better off. If it is sharing with any other OS/DB resource, the
> performance will be impacted.
>
>From what I have learned from others on this list, RAID5 is not the best
> choice for the database. RAID10 would be a better solution (using 8 of
> your disks) then take the remaining disk and do mirror with your pg_xlog
> if possible.
>
> Best of luck,
>
> Steve Poe
>
> On Thu, 2005-08-11 at 13:23 +0100, Paul Johnson wrote:
>
>>Hi all, we're running PG8 on a Sun V250 with 8GB RAM and 2*1.3GHz SPARC
>>CPUs running Solaris 10. The DB cluster is on an external fibre-attached
>>Sun T3 array that has 9*36GB drives configured as a single RAID5 LUN.
>>
>>The system is for the sole use of a couple of data warehouse developers,
>>hence we are keen to use 'aggressive' tuning options to maximise
>>performance.
>>
>>So far we have made the following changes and measured the impact on our
>>test suite:
>>
>>1) Increase checkpoint_segments from 3 to 64. This made a 10x improvement
>>in some cases.
>>
>>2) Increase work_mem from 1,024 to 524,288.
>>
>>3) Increase shared_buffers from 1,000 to 262,143 (2 GB). This required
>>setting SHMMAX=4294967295 (4 GB) in /etc/system and re-booting the box.
>>
>>Question - can Postgres only use 2GB RAM, given that shared_buffers can
>>only be set as high as 262,143 (8K pages)?
>>
>>So far so good...
>>
>>4) Move /pg_xlog to an internal disk within the V250. This has had a
>>severe *negative* impact on performance. Copy job has gone from 2 mins to
>>12 mins, simple SQL job gone from 1 min to 7 mins. Not even run long SQL
>>jobs.
>>
>>I'm guessing that this is because pg_xlog has gone from a 9 spindle LUN to
>>a single spindle disk?
>>
>>In cases such as this, where an external storage array with a hardware
>>RAID controller is used, the normal advice to separate the data from the
>>pg_xlog  seems to come unstuck, or are we missing something?
>>
>>Cheers,
>>
>>Paul Johnson.
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 4: Have you searched our list archives?
>>
>>               http://archives.postgresql.org
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>
>
>

От:
Mark Lewis
Дата:

(Musing, trying to think of a general-purpose performance-tuning rule
that applies here):

Actually, it seems to me that with the addition of the WAL in PostgreSQL
and the subsequent decreased need to fsync the data files themselves
(only during checkpoints?), that the only time a battery-backed write
cache would make a really large performance difference would be on the
drive(s) hosting the WAL.

So although it is in general good to have a dedicated spindle for the
WAL, for many workloads it is in fact significantly better to have the
WAL written to a battery-backed write cache.  The exception would be for
applications with fewer, larger transactions, in which case you could
actually use the dedicated spindle.

Hmmm, on second thought, now I think I understand the rationale behind
having a non-zero commit delay setting-- the problem with putting
pg_xlog on a single disk without a write cache is that frequent fsync()
calls might cause it to spend most of its time seeking instead of
writing (as seems to be happening to Paul here).  Then again, the OS IO
scheduler should take care of this for you, making this a non-issue.
Perhaps Solaris 10 just has really poor IO scheduling performance with
this particular hardware and workload?

Ah well.  Thought myself in circles and have no real conclusions to show
for it.  Posting anyway, maybe this will give somebody some ideas to
work with.

-- Mark Lewis

On Fri, 2005-08-12 at 08:47 +0000, Steve Poe wrote:
> Paul,
>
> Before I say anything else, one online document which may be of
> assistance to you is:
> http://www.powerpostgresql.com/PerfList/
>
> Some thoughts I have:
>
> 3) You're shared RAM setting seems overkill to me. Part of the challenge
> is you're going from 1000 to 262K with no assessment in between. Each
> situation can be different, but try in the range of 10 - 50K.
>
> 4) pg_xlog: If you're pg_xlog is on a spindle is *only* for pg_xlog
> you're better off. If it is sharing with any other OS/DB resource, the
> performance will be impacted.
>
> >From what I have learned from others on this list, RAID5 is not the best
> choice for the database. RAID10 would be a better solution (using 8 of
> your disks) then take the remaining disk and do mirror with your pg_xlog
> if possible.
>
> Best of luck,
>
> Steve Poe
>
> On Thu, 2005-08-11 at 13:23 +0100, Paul Johnson wrote:
> > Hi all, we're running PG8 on a Sun V250 with 8GB RAM and 2*1.3GHz SPARC
> > CPUs running Solaris 10. The DB cluster is on an external fibre-attached
> > Sun T3 array that has 9*36GB drives configured as a single RAID5 LUN.
> >
> > The system is for the sole use of a couple of data warehouse developers,
> > hence we are keen to use 'aggressive' tuning options to maximise
> > performance.
> >
> > So far we have made the following changes and measured the impact on our
> > test suite:
> >
> > 1) Increase checkpoint_segments from 3 to 64. This made a 10x improvement
> > in some cases.
> >
> > 2) Increase work_mem from 1,024 to 524,288.
> >
> > 3) Increase shared_buffers from 1,000 to 262,143 (2 GB). This required
> > setting SHMMAX=4294967295 (4 GB) in /etc/system and re-booting the box.
> >
> > Question - can Postgres only use 2GB RAM, given that shared_buffers can
> > only be set as high as 262,143 (8K pages)?
> >
> > So far so good...
> >
> > 4) Move /pg_xlog to an internal disk within the V250. This has had a
> > severe *negative* impact on performance. Copy job has gone from 2 mins to
> > 12 mins, simple SQL job gone from 1 min to 7 mins. Not even run long SQL
> > jobs.
> >
> > I'm guessing that this is because pg_xlog has gone from a 9 spindle LUN to
> > a single spindle disk?
> >
> > In cases such as this, where an external storage array with a hardware
> > RAID controller is used, the normal advice to separate the data from the
> > pg_xlog  seems to come unstuck, or are we missing something?
> >
> > Cheers,
> >
> > Paul Johnson.
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 4: Have you searched our list archives?
> >
> >                http://archives.postgresql.org
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq


От:
"Merlin Moncure"
Дата:

> Actually, it seems to me that with the addition of the WAL in
PostgreSQL
> and the subsequent decreased need to fsync the data files themselves
> (only during checkpoints?), that the only time a battery-backed write
> cache would make a really large performance difference would be on the
> drive(s) hosting the WAL.

It still helps.  In my experience a good BBU Raid controller is only
slightly slower than fsync=false.  Fear the checkpoint storm if you
don't have some write caching.  Beyond that I don't really care about
write delay.

Another thing to watch out for is that some sync modes (varying by
platform) can do >1 seeks per sync.  This will absolutely kill your
commit performance on the WAL without write caching.

> So although it is in general good to have a dedicated spindle for the
> WAL, for many workloads it is in fact significantly better to have the
> WAL written to a battery-backed write cache.  The exception would be
for
> applications with fewer, larger transactions, in which case you could
> actually use the dedicated spindle.

Exactly.


> Hmmm, on second thought, now I think I understand the rationale behind
> having a non-zero commit delay setting-- the problem with putting

I don't trust commit_delay.  Get a good raid controller and make sure pg
is properly using it.  Now, if you can't (or won't) do some type of
write caching bbu or no, your system has to be very carefully designed
to get any performance at all, especially with high transaction volumes.


Merlin

От:
Michael Stone
Дата:

On Thu, Aug 11, 2005 at 10:18:44AM -0700, Mark Lewis wrote:
>Actually, it seems to me that with the addition of the WAL in PostgreSQL
>and the subsequent decreased need to fsync the data files themselves
>(only during checkpoints?), that the only time a battery-backed write
>cache would make a really large performance difference would be on the
>drive(s) hosting the WAL.

Write cache on a raid array helps in the general case too, because
it allows the controller to aggregate & reorder write requests. The OS
probably tries to do this to some degree, but it can't do as well as the
raid controller because it doesn't know the physical disk layout.

>Hmmm, on second thought, now I think I understand the rationale behind
>having a non-zero commit delay setting-- the problem with putting
>pg_xlog on a single disk without a write cache is that frequent fsync()
>calls might cause it to spend most of its time seeking instead of
>writing (as seems to be happening to Paul here).  Then again, the OS IO
>scheduler should take care of this for you, making this a non-issue.

The OS can't do anything much in terms of IO scheduling for synchronous
writes. Either it handles them in suboptimal order or you get hideous
latency while requests are queued for reordering. Neither option is
really great.

Mike Stone

От:
Jeff Trout
Дата:

On Aug 11, 2005, at 12:58 PM, Jeffrey W. Baker wrote:

> Like Mr. Stone said earlier, this is pure dogma.  In my experience,
> xlogs on the same volume with data is much faster if both are on
> battery-backed write-back RAID controller memory.  Moving from this
> situation to xlogs on a single normal disk is going to be much
> slower in
> most cases.
>

This does also point one important point about performance.  Which is
a touch unfortunate (and expensive to test):  Your milage may vary on
any of these improvements.   Some people have 0 problems and
incredible performance with say, 1000 shared_bufs and the WAL on the
same disk..  Others need 10k shared bufs and wal split over a 900
spindle raid with data spread across 18 SAN's...
Unfortunately there is no one true way :(

The best bet (which is great if you can): Try out various settings..
if you still run into problems look into some more hardware.. see if
you can borrow any or fabricate a "poor man"'s equivalent for testing.

--
Jeff Trout <>
http://www.jefftrout.com/
http://www.stuarthamm.net/



От:
Josh Berkus
Дата:

Jeff,

> > 4) pg_xlog: If you're pg_xlog is on a spindle is *only* for pg_xlog
> > you're better off.
>
> Like Mr. Stone said earlier, this is pure dogma.  In my experience,
> xlogs on the same volume with data is much faster if both are on
> battery-backed write-back RAID controller memory.  Moving from this
> situation to xlogs on a single normal disk is going to be much slower in
> most cases.

The advice on separate drives for xlog (as is all advice on that web page) is
based on numerous, repeatable tests at OSDL.

However, you are absolutely correct in that it's *relative* advice, not
absolute advice.   If, for example, you're using a $100,000 EMC SAN as your
storage you'll probably be better off giving it everything and letting its
controller and cache handle disk allocation etc.   On the other hand, if
you're dealing with the 5 drives in a single Dell 6650, I've yet to encounter
a case where a separate xlog disk did not benefit an OLTP application.

For Solaris, the advantage of using a separate disk or partition is that the
mount options you want for the xlog (including forcedirectio) are
considerably different from what you'd use with the main database.

--
Josh Berkus
Aglio Database Solutions
San Francisco

От:
Alvaro Herrera
Дата:

On Tue, Aug 16, 2005 at 09:12:31AM -0700, Josh Berkus wrote:

> However, you are absolutely correct in that it's *relative* advice, not
> absolute advice.   If, for example, you're using a $100,000 EMC SAN as your
> storage you'll probably be better off giving it everything and letting its
> controller and cache handle disk allocation etc.   On the other hand, if
> you're dealing with the 5 drives in a single Dell 6650, I've yet to encounter
> a case where a separate xlog disk did not benefit an OLTP application.

I've been asked this a couple of times and I don't know the answer: what
happens if you give XLog a single drive (unmirrored single spindle), and
that drive dies?  So the question really is, should you be giving two
disks to XLog?

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere."                (Lamar Owen)

От:
"Joshua D. Drake"
Дата:

> I've been asked this a couple of times and I don't know the answer: what
> happens if you give XLog a single drive (unmirrored single spindle), and
> that drive dies?  So the question really is, should you be giving two
> disks to XLog?

If that drive dies your restoring from backup. You would need to run at
least RAID 1, preferrably RAID 10.

Sincerely,

Joshua D. Drkae


>


--
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/

От:
John A Meinel
Дата:

Alvaro Herrera wrote:
> On Tue, Aug 16, 2005 at 09:12:31AM -0700, Josh Berkus wrote:
>
>
>>However, you are absolutely correct in that it's *relative* advice, not
>>absolute advice.   If, for example, you're using a $100,000 EMC SAN as your
>>storage you'll probably be better off giving it everything and letting its
>>controller and cache handle disk allocation etc.   On the other hand, if
>>you're dealing with the 5 drives in a single Dell 6650, I've yet to encounter
>>a case where a separate xlog disk did not benefit an OLTP application.
>
>
> I've been asked this a couple of times and I don't know the answer: what
> happens if you give XLog a single drive (unmirrored single spindle), and
> that drive dies?  So the question really is, should you be giving two
> disks to XLog?
>

I can propose a simple test. Create a test database. Run postgres,
insert a bunch of stuff. Stop postgres. Delete everything in the pg_xlog
directory. Start postgres again, what does it do?

I suppose to simulate more of a failure mode, you could kill -9 the
postmaster (and all children processes) perhaps during an insert, and
then delete pg_xlog.

But I would like to hear from the postgres folks what they *expect*
would happen if you ever lost pg_xlog.

What about something like keeping pg_xlog on a ramdisk, and then
rsyncing it to a hard-disk every 5 minutes. If you die in the middle,
does it just restore back to the 5-minutes ago point, or does it get
more thoroughly messed up?
For some people, a 5-minute old restore would be okay, as long as you
still have transaction safety, so that you can figure out what needs to
be restored.

John
=:->


От:
Michael Stone
Дата:

On Tue, Aug 16, 2005 at 09:12:31AM -0700, Josh Berkus wrote:
>However, you are absolutely correct in that it's *relative* advice, not
>absolute advice.   If, for example, you're using a $100,000 EMC SAN as your
>storage you'll probably be better off giving it everything and letting its
>controller and cache handle disk allocation etc.

Well, you don't have to spend *quite* that much to get a decent storage
array. :)

>On the other hand, if you're dealing with the 5 drives in a single Dell
>6650, I've yet to encounter a case where a separate xlog disk did not
>benefit an OLTP application.

IIRC, that's an older raid controller that tops out at 128MB write
cache, and 5 spindles ain't a lot--so it makes sense that it would
benefit from a seperate spindle for xlog. Also note that I said the
write cache advice goes out the window if you have a workload that
involves constant writing (or if your xlog writes come in faster than
your write cache can drain) because at that point you essentially drop
back to raw disk speed; I assume the OLTP apps you mention are fairly
write-intensive.  OTOH, in a reasonably safe configuration I suppose
you'd end up with a 3 disk raid5 / 2 disk raid1 or 2 raid 1 pairs on
that dell 6650; is that how you test? Once you're down to that small a
data set I'd expect the system's ram cache to be a much larger
percentage of the working set, which would tend to make the xlog just
about the *only* latency-critical i/o. That's a different creature from
a data mining app that might really benefit from having additional
spindles to accelerate read performance from indices much larger than
RAM. At any rate, this just underscores the need for testing a
particular workload on particular hardware. Things like the disk speed,
raid configuration, write cache size, transaction size, data set size,
working set size, concurrent transactions, read vs write emphasis, etc.,
are each going to have a fairly large impact on performance.

>For Solaris, the advantage of using a separate disk or partition is that the
>mount options you want for the xlog (including forcedirectio) are
>considerably different from what you'd use with the main database.

Yeah, having a seperate partition is often good even if you do have
everything on the same disks.

Mike Stone

От:
Tom Lane
Дата:

John A Meinel <> writes:
> Alvaro Herrera wrote:
>> I've been asked this a couple of times and I don't know the answer: what
>> happens if you give XLog a single drive (unmirrored single spindle), and
>> that drive dies?  So the question really is, should you be giving two
>> disks to XLog?

> I can propose a simple test. Create a test database. Run postgres,
> insert a bunch of stuff. Stop postgres. Delete everything in the pg_xlog
> directory. Start postgres again, what does it do?

That test would really be completely unrelated to the problem.

If you are able to shut down the database cleanly, then you do not need
pg_xlog anymore --- everything is on disk in the data area.  You might
have to use pg_resetxlog to get going again, but you won't lose anything
by doing so.

The question of importance is: if the xlog drive dies while the database
is running, are you going to be able to get the postmaster to shut down
cleanly?  My suspicion is "no" --- if the kernel is reporting write
failures on WAL, that's going to prevent writes to the data drives (good
ol' WAL-before-data rule).  You could imagine failure modes where the
drive is toast but isn't actually reporting any errors ... but one hopes
that's not a common scenario.

In a scenario like this, it might be interesting to have a shutdown mode
that deliberately ignores writing to WAL and just does its best to get
all the dirty pages down onto the data drives.

In the meantime ... use a mirrored drive for WAL.

            regards, tom lane

От:
John A Meinel
Дата:

Tom Lane wrote:
> John A Meinel <> writes:
>
>>Alvaro Herrera wrote:
>>
>>>I've been asked this a couple of times and I don't know the answer: what
>>>happens if you give XLog a single drive (unmirrored single spindle), and
>>>that drive dies?  So the question really is, should you be giving two
>>>disks to XLog?
>
>
>>I can propose a simple test. Create a test database. Run postgres,
>>insert a bunch of stuff. Stop postgres. Delete everything in the pg_xlog
>>directory. Start postgres again, what does it do?
>
>
> That test would really be completely unrelated to the problem.
>
> If you are able to shut down the database cleanly, then you do not need
> pg_xlog anymore --- everything is on disk in the data area.  You might
> have to use pg_resetxlog to get going again, but you won't lose anything
> by doing so.

So pg_xlog is really only needed for a dirty shutdown. So what about the
idea of having pg_xlog on a ramdisk that is syncronized periodically to
a real disk.

I'm guessing you would get corruption of the database, or at least you
don't know what is clean and what is dirty, since there would be no WAL
entry for some of the things that completed, but also no WAL entry for
things that were not completed.

So what is actually written to the WAL? Is it something like:
"I am writing these pages, and when page X has a certain value, I am
finished"

I'm just curious, because I don't believe you write to the WAL when you
complete the writing the data, you only make a note about what you are
going to do before you do it. So there needs to be a way to detect if
you actually finished (which would be in the actual data).

John
=:->

>
> The question of importance is: if the xlog drive dies while the database
> is running, are you going to be able to get the postmaster to shut down
> cleanly?  My suspicion is "no" --- if the kernel is reporting write
> failures on WAL, that's going to prevent writes to the data drives (good
> ol' WAL-before-data rule).  You could imagine failure modes where the
> drive is toast but isn't actually reporting any errors ... but one hopes
> that's not a common scenario.
>
> In a scenario like this, it might be interesting to have a shutdown mode
> that deliberately ignores writing to WAL and just does its best to get
> all the dirty pages down onto the data drives.
>
> In the meantime ... use a mirrored drive for WAL.
>
>             regards, tom lane
>


От:
Tom Lane
Дата:

John A Meinel <> writes:
> So pg_xlog is really only needed for a dirty shutdown. So what about the
> idea of having pg_xlog on a ramdisk that is syncronized periodically to
> a real disk.

Well, if "periodically" means "at every transaction commit", that's
pretty much what we do now.

            regards, tom lane

От:
Josh Berkus
Дата:

Michael,

> Well, you don't have to spend *quite* that much to get a decent storage
> array. :)

Yes, I'm just pointing out that it's only the extreme cases which are
clear-cut.  Middle cases are a lot harder to define.  For example, we've
found that on DBT2 running of a 14-drive JBOD, seperating off WAL boosts
performance about 8% to 14%.  On DBT3 (DSS) seperate WAL (and seperate
tablespaces) helps considerably during data load,but not otherwise.  So it
all depends.

> That's a different creature from
> a data mining app that might really benefit from having additional
> spindles to accelerate read performance from indices much larger than
> RAM.

Yes, although the data mining app benefits from the special xlog disk
during ETL.  So it's a tradeoff.

> At any rate, this just underscores the need for testing a
> particular workload on particular hardware

Yes, absolutely.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco