Обсуждение: Excessive (and slow) fsync() within single transaction

Поиск
Список
Период
Сортировка

Excessive (and slow) fsync() within single transaction

От
Stephen Tyler
Дата:
I've been trying to track down a performance issue I have.  In simple terms, my select performance is very good (generally either CPU limited, or disk limited, depending upon the query), and small updates seem OK.

But a huge UPDATE is incredibly slow.  CPU is on average below 1%, and disk IO is generally below 0.1% of what I would expect.  It seems to be doing tiny bursts of activity, interspersed with long periods of inactivity.  I've observed the same behaviour in many other cases where the amount of writing is large.

My system:

Mac Pro 2009 Quad 2.93 with 16G of ECC RAM
Snow Leopard 10.6.2 in 64bit mode, fully patched
Database on RAID 0 array of Intel X-25M SSDs
Postgres 8.4.1, 64 bit, compiled from source

From outside of postgres, I can sustain over 400MB/s read, and 140MB/s write indefinitely.  And around 10K random read I/Os/sec, and 4K random write I/Os.  With no hint of pauses.  I have never witnessed anything other than speedy performance from the SSDs except from within postgresql.

An example of a problematic SQL command is:
update link_relurl as u set sid = m.sid from link_meta as m where u.link = m.link;

Both tables are around 20Gbytes, and 200M rows, with link as the primary integer key.  So in this example, I am joining 2 tables or 200M rows and writing a new value to a column of one of those tables.

The command is issued via psql.  There are no other connections to the database.

My expectation would be that postgresql would issue an fsync() (or perhaps a few such calls) at the end of the transaction.  But this does not seem to be the case:

Using the DTrace utility dtruss to time all calls to fsync being made by postgres:
$> sudo dtruss -n postgres -t fsync -e
  107/0x2dd:        7 fsync(0x3, 0xD, 0x100805400)         = 0 0
  107/0x2dd:        5 fsync(0x5, 0x8, 0x100805400)         = 0 0
  107/0x2dd:        4 fsync(0x21, 0x1A, 0x100805400)         = 0 0
  107/0x2dd:        5 fsync(0x1A, 0x8, 0x100805400)         = 0 0
  107/0x2dd:        4 fsync(0x19, 0x12, 0x100805400)         = 0 0
  107/0x2dd:        4 fsync(0x4, 0x2, 0x100805400)         = 0 0
  107/0x2dd:        5 fsync(0x7, 0x12, 0x100805400)         = 0 0
  107/0x2dd:        4 fsync(0x1C, 0xB, 0x100805400)         = 0 0
  107/0x2dd:        4 fsync(0x1E, 0xB, 0x100805400)         = 0 0
  107/0x2dd:      174 fsync(0x24, 0x103102AD8, 0x0)         = 0 0
  107/0x2dd:     2018 fsync(0x20, 0x1B, 0x100805400)         = 0 0
  107/0x2dd:        7 fsync(0x12, 0xC, 0x100805400)         = 0 0
  107/0x2dd:      987 fsync(0x23, 0x1B, 0x100805400)         = 0 0
.....

With thousands of such lines.  The columns are PID, elapsed time in microseconds, and the system call with arguments.  In the lines above, the longest fsync is only 2 milliseconds.  Most of the lines are similar to the above.

So firstly, why are there so many calls to fsync()?

Unfortunately, some of the calls to fsync are incredibly slow:
  107/0x2dd:  22526373 fsync(0x19, 0x1, 0x100805400)         = 0 0
  107/0x2dd:  22637723 fsync(0x10, 0x9, 0x100805400)         = 0 0
  107/0x2dd:  61745976 fsync(0xA, 0xF, 0x100805400)         = 0 0

So in this triplet of consecutive calls we have elapsed times of 22, 22 and 61 seconds!

I'm no expert on DTrace, nor the inner workings of postgresql, but if anybody has any ideas on what might be causing this behaviour, how to track it down, or how to fix or avoid it, I would be very grateful.

Stephen

Re: Excessive (and slow) fsync() within single transaction

От
Tom Lane
Дата:
Stephen Tyler <stephen@stephen-tyler.com> writes:
> My expectation would be that postgresql would issue an fsync() (or perhaps a
> few such calls) at the end of the transaction.  But this does not seem to be
> the case:

fsyncs mostly happen as a result of checkpoint activity.  I wonder
whether you have done any tuning of checkpoint intervals,
shared_buffers, etc.

The very long times to complete the fsyncs sound like a kernel or driver
issue; not much we can do to help you with that.

            regards, tom lane

Re: Excessive (and slow) fsync() within single transaction

От
Greg Smith
Дата:
Stephen Tyler wrote:
> So firstly, why are there so many calls to fsync()?
Every time a transaction commits, you get a fsync to the WAL file.
Then, during the periodic database checkpoints, you get more fsync'd
writes.  The latter are more likely to be your problem.

You should turn on log_checkpoint in the postgresql.conf and confirm the
slowdowns are happening around the same time as the checkpoint report
gets written to the log files.  Presuming that turns out to be the case,
you'll want to follow the suggestions at
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server to improve
the behavior, and you can get a lot more background about what's
happening at
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

We just went through one of these last week with someone else, you might
find the troubleshooting process and surrounding discussion interesting
reading:

http://archives.postgresql.org/pgsql-performance/2009-12/msg00033.php

One useful later stage in troubleshooting problems in this area is to
reduce the amount of caching done by the operating system, so that there
aren't as many blocks backed up when the fsync calls happen.  I'm not
sure how to do this on OS X though.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: Excessive (and slow) fsync() within single transaction

От
Merlin Moncure
Дата:
On Tue, Dec 8, 2009 at 7:28 PM, Stephen Tyler <stephen@stephen-tyler.com> wrote:
> I've been trying to track down a performance issue I have.  In simple terms,
> my select performance is very good (generally either CPU limited, or disk
> limited, depending upon the query), and small updates seem OK.
>
> But a huge UPDATE is incredibly slow.  CPU is on average below 1%, and disk
> IO is generally below 0.1% of what I would expect.  It seems to be doing
> tiny bursts of activity, interspersed with long periods of inactivity.  I've
> observed the same behaviour in many other cases where the amount of writing
> is large.
>
> My system:
>
> Mac Pro 2009 Quad 2.93 with 16G of ECC RAM
> Snow Leopard 10.6.2 in 64bit mode, fully patched
> Database on RAID 0 array of Intel X-25M SSDs
> Postgres 8.4.1, 64 bit, compiled from source
>
> From outside of postgres, I can sustain over 400MB/s read, and 140MB/s write
> indefinitely.  And around 10K random read I/Os/sec, and 4K random write
> I/Os.  With no hint of pauses.  I have never witnessed anything other than
> speedy performance from the SSDs except from within postgresql.
>
> An example of a problematic SQL command is:
> update link_relurl as u set sid = m.sid from link_meta as m where u.link =
> m.link;
>
> Both tables are around 20Gbytes, and 200M rows, with link as the primary
> integer key.  So in this example, I am joining 2 tables or 200M rows and
> writing a new value to a column of one of those tables.
>
> The command is issued via psql.  There are no other connections to the
> database.
>
> My expectation would be that postgresql would issue an fsync() (or perhaps a
> few such calls) at the end of the transaction.  But this does not seem to be
> the case:
>
> Using the DTrace utility dtruss to time all calls to fsync being made by
> postgres:
> $> sudo dtruss -n postgres -t fsync -e
>   107/0x2dd:        7 fsync(0x3, 0xD, 0x100805400)         = 0 0
>   107/0x2dd:        5 fsync(0x5, 0x8, 0x100805400)         = 0 0
>   107/0x2dd:        4 fsync(0x21, 0x1A, 0x100805400)         = 0 0
>   107/0x2dd:        5 fsync(0x1A, 0x8, 0x100805400)         = 0 0
>   107/0x2dd:        4 fsync(0x19, 0x12, 0x100805400)         = 0 0
>   107/0x2dd:        4 fsync(0x4, 0x2, 0x100805400)         = 0 0
>   107/0x2dd:        5 fsync(0x7, 0x12, 0x100805400)         = 0 0
>   107/0x2dd:        4 fsync(0x1C, 0xB, 0x100805400)         = 0 0
>   107/0x2dd:        4 fsync(0x1E, 0xB, 0x100805400)         = 0 0
>   107/0x2dd:      174 fsync(0x24, 0x103102AD8, 0x0)         = 0 0
>   107/0x2dd:     2018 fsync(0x20, 0x1B, 0x100805400)         = 0 0
>   107/0x2dd:        7 fsync(0x12, 0xC, 0x100805400)         = 0 0
>   107/0x2dd:      987 fsync(0x23, 0x1B, 0x100805400)         = 0 0
> .....
>
> With thousands of such lines.  The columns are PID, elapsed time in
> microseconds, and the system call with arguments.  In the lines above, the
> longest fsync is only 2 milliseconds.  Most of the lines are similar to the
> above.
>
> So firstly, why are there so many calls to fsync()?
>
> Unfortunately, some of the calls to fsync are incredibly slow:
>   107/0x2dd:  22526373 fsync(0x19, 0x1, 0x100805400)         = 0 0
>   107/0x2dd:  22637723 fsync(0x10, 0x9, 0x100805400)         = 0 0
>   107/0x2dd:  61745976 fsync(0xA, 0xF, 0x100805400)         = 0 0
>
> So in this triplet of consecutive calls we have elapsed times of 22, 22 and
> 61 seconds!
>
> I'm no expert on DTrace, nor the inner workings of postgresql, but if
> anybody has any ideas on what might be causing this behaviour, how to track
> it down, or how to fix or avoid it, I would be very grateful.

curious, have you checked the log for frequent checkpoint warnings?
Are you logging checkpoints and seeing if they line up with the slow
fsync times? my guess here is that you overflowing the write back
cache on the x25-m and the long fsync times are waiting for the disk
to flush the cache, or some background housecleaning is going on.
This behavior is very similar to what I've observed on certain raid
controllers when the cache overflows.

possible resolutions might include:
*) you might luck out and be dealing with known hopefully corrected
issue with intel drives
*) tweak checkpoints/bgwriter to try and reduce the penalty of fsync
*) disable fsync...risky, but I bet will make the problem go away
*) optimize your query. maybe try rewriting as delete/insert select
from scratch table?

merlin

Re: Excessive (and slow) fsync() within single transaction

От
Stephen Tyler
Дата:
Some more information on fsync using the test_fsync tool in postgres/src/tools:

System is Snow Leopard 10.6.2 (64 bit).  Volumes are HFS+ formatted.

PASS 1 - DATABASE VOLUME (SSD RAID 0)
===================================
$ sudo ./test_fsync -f /Volumes/SSD/fsync_test.out
Simple write timing:
    write                    0.127178

Compare fsync times on write() and non-write() descriptor:
If the times are similar, fsync() can sync data written
on a different descriptor.
    write, fsync, close      0.254609
    write, close, fsync      0.316973

Compare one o_sync write to two:
    one 16k o_sync write     0.260829
    two 8k o_sync writes     0.259231

Compare file sync methods with one 8k write:
    open o_dsync, write      0.135035
    open o_sync, write       0.128056
    write, fdatasync         0.154622
    write, fsync             0.154496

Compare file sync methods with two 8k writes:
    open o_dsync, write      0.260390
    open o_sync, write       0.426376
    write, fdatasync         0.382204
    write, fsync             0.249774

PASS 2 - SATA HDD (7200 rpm)
=========================
$ sudo ./test_fsync -f /Volumes/Work/fsync_test.out
Simple write timing:
    write                    0.205038

Compare fsync times on write() and non-write() descriptor:
If the times are similar, fsync() can sync data written
on a different descriptor.
    write, fsync, close      0.498885
    write, close, fsync      0.383176

Compare one o_sync write to two:
    one 16k o_sync write     0.237621
    two 8k o_sync writes     0.458676

Compare file sync methods with one 8k write:
    open o_dsync, write      0.171746
    open o_sync, write       0.195227
    write, fdatasync         0.195102
    write, fsync             0.218675

Compare file sync methods with two 8k writes:
    open o_dsync, write      0.345076
    open o_sync, write       0.368541
    write, fdatasync         0.254322
    write, fsync             0.271039

If the tests are repeated, the numbers are not very stable, but generally in the range 0.18 to 0.48.

The times are apparently in milliseconds.  But that doesn't make any sense for the 7200rpm SATA HDD, given it has an 8.3ms rotation time.  And the numbers for the SSD are basically identical to the HDD, apart from the bare write timing.

Do I conclude that the sync timings are all faked by the OS?

Stephen

Re: Excessive (and slow) fsync() within single transaction

От
Stephen Tyler
Дата:
On Wed, Dec 9, 2009 at 12:57 PM, Greg Smith <greg@2ndquadrant.com> wrote:
You should turn on log_checkpoint in the postgresql.conf and confirm the slowdowns are happening around the same time as the checkpoint report gets written to the log files.
 
Thank you for your suggestions.  I will turn on log_checkpoint next time I get the chance to restart.

Some more details about my current config (almost no activity since boot except the one large update still in progress):

select * from pg_stat_bgwriter;
 checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc
-------------------+-----------------+--------------------+---------------+------------------+-----------------+---------------
               440 |              51 |           39248820 |       8876042 |            85527 |         3952243 |      48756188
and a couple of hours later:
               465 |              56 |           43316686 |       9855720 |            95058 |         3997393 |      53713342

My analysis:

checkpoints_timed increments roughly every 5 minutes, and checkpoints_req increments roughly every 30 minutes.

Around 80% of the checkpoints are timed, rather than requested.  But the pauses/freezes are of the order of 30 to 60 seconds apart (roughly aligned with similar slow fsync calls), not 5 minutes apart.  There are multiple freezes per checkpoint.

During each checkpoint, 83K * 8K blocks were written out, or 650MB on average per checkpoint.  Around 20% were cleaned by the background cleaner.  if this occured at anywhere near my drive's average random write speed it should take less than 20 seconds even if 100% random disc blocks.

I don't understand how maxwritten_clean could be as high as 95058, and increment at more than 1 per second.  This is a process count, not a buffer count?  How often is the background cleaner launched?  Does that mean I need to massively increase bgwriter_lru_maxpages, and other bgwriter params?  They are currently default values.


On Wed, Dec 9, 2009 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
fsyncs mostly happen as a result of checkpoint activity.  I wonder
whether you have done any tuning of checkpoint intervals,
shared_buffers, etc.

I wouldn't describe the values as "tuned" ... more like set and forget ...
postgresql.conf (currently):

max_connections = 50
shared_buffers = 4GB
work_mem = 64MB
maintenance_work_mem = 1GB
synchronous_commit = 'off'
wal_buffers = 8MB
checkpoint_segments = 128
autovacuum = 'on'
effective_cache_size = 12GB
default_statistics_target = 100
constraint_exclusion = 'on'


shared_buffers is 1/4 of RAM, and effective_cache_size is 3/4 RAM.  The system is not swapping, and currently 9.5GB of RAM is marked as "Inactive".

checkpoint_segments is large, but in the suggested range of 32 to 256 for large, write-heavy, batch-oriented systems.  I've used smaller values in the past, but I got lots of checkpoint warnings in the console.

wal_buffers might be a bit high.  I don't fully understand the implications, and how to tune this value, even after reading the tuning guides.

On Wed, Dec 9, 2009 at 4:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
possible resolutions might include:
*) you might luck out and be dealing with known hopefully corrected
issue with intel drives

Latest firmware for the 50nm version.  I haven't found any known issue.
During the pauses, "Activity Monitor" reports no disk activity.  But I am not sure if this implies no SATA traffic.
 
*) tweak checkpoints/bgwriter to try and reduce the penalty of fsync

Make them more frequent but smaller, or less frequent but bigger?  Which way has less penalty?
 
*) disable fsync...risky, but I bet will make the problem go away

I'm very tempted to live dangerously.  I would like to actually turn off fsync's during a transaction, and just use the fsync at the very end of the transaction to confirm that the transaction will be permanent.  Or maybe use the filesystem journaling function to ensure I can roll-back to a valid state if something bad happens mid-transaction.  But these do not seem to be options.
 
*) optimize your query. maybe try rewriting as delete/insert select
from scratch table?

Unfortunately, I have a lot of foreign keys with "ON DELETE CASCADE", and various other triggers.  While that greatly simplified my housekeeping code, I can no longer do delete & reinsert.

I can break the big transactions into hundreds or thousands of little transactions.  This does let me stop/restart the system at will, but I've yet to see it help my throughput.



Maybe I've overlooked something, but nothing looks too wrong with postgresql.conf to me, based on my reading of the tuning guides.  One checkpoint write every 5 minutes (80% timed) doesn't seem too bad.  Something else (probably the very slow fsync) is keeping the checkpoint writing rate low, and there appear to be a lot more fsync's than checkpoint segments.

Stephen

Re: Excessive (and slow) fsync() within single transaction

От
Scott Marlowe
Дата:
On Wed, Dec 9, 2009 at 5:11 AM, Stephen Tyler <stephen@stephen-tyler.com> wrote:
> The times are apparently in milliseconds.  But that doesn't make any sense
> for the 7200rpm SATA HDD, given it has an 8.3ms rotation time.  And the
> numbers for the SSD are basically identical to the HDD, apart from the bare
> write timing.
>
> Do I conclude that the sync timings are all faked by the OS?

Actually, it's usually the drives that lie about fsync, especially
consumer grade (and some server grade) SATA / PATA drives are known
for this.

Re: Excessive (and slow) fsync() within single transaction

От
Ron Mayer
Дата:
Scott Marlowe wrote:
> Actually, it's usually the drives that lie about fsync, especially
> consumer grade (and some server grade) SATA / PATA drives are known
> for this.

I'm still looking for any evidence of any drive that lies.

Is there actually a drive which claims to support the
IDE "FLUSH_CACHE_EXT" feature (IDE command 0xEA) that doesn't?!?

You can see if your drive claims to - most this century do -
in Linux with "hdparm -I /dev/sdX" which should show FLUSH_CACHE_EXT
if the disk supports this extension.  And if your disk is using a
last-century IDE spec that predated that feature, I think your kernel
probably told you in the startup messages that your drive is
not so much lying, but obsolete and missing a feature.

Software on the other hand sucks at sending these commands.

Ext3 will only send the right IDE commands if the inode was
touched[1]. VirtualBox doesn't send them correctly[2].  I
hear LVM and MD raid can lose those events too.
Since so many layers don't send the right IDE commands to flush,
userspace utilities exist to send those commands[4].  I
sometimes wonder if postgres should add something like that.

But I still haven't found any drive that lies -- and only a couple
who's firmware openly admits not to support the feature.


[1] http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg272253.html
[2] http://openquery.com/blog/host-flush
[3] http://forum.soft32.com/linux/flush-disk-write-cache-userspace-ftopict338477.html


Re: Excessive (and slow) fsync() within single transaction

От
Greg Smith
Дата:
Stephen Tyler wrote:
> The times are apparently in milliseconds.  But that doesn't make any
> sense for the 7200rpm SATA HDD, given it has an 8.3ms rotation time.
> And the numbers for the SSD are basically identical to the HDD, apart
> from the bare write timing.
> Do I conclude that the sync timings are all faked by the OS?
PostgreSQL on OS X uses a special fsync method by default named
'fsync_writethrough'.  If you do this on your database you should be
able to confirm it's using it:

show wal_sync_method;

This method includes a call to Apple's API to make sure the cache is
properly flushed on the drives.  Lots more details about how all this
works at http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm

I'm not sure if the test_fsync tool actually tests this particular
wal_sync_method though, so on OS X you may only be getting reports that
don't include a proper drive cache flush.  The server itself does the
right thing.  Making test_fsync do exactly the same thing as the real
WAL write in all cases has been a lingering TODO item for a while now, I
submitted a patch to improve the whole thing a while ago but it's not
really robust in all situations yet.  Unfortunately I think you're
getting bad data from the test tool, and I'm not sure of a source for a
similar one that includes the right drive flushing behavior for OS X.
(As a general note, there's a similar call on Windows, and I believe it
also suffers from this problem when running test_fsync ; that little
test utility has really only been tested well on UNIX-ish systems).

What you can do in order to test fsync rate as a side-effect of other
work the server is to run pgbench with a custom script.  If you grab a
copy of

http://git.postgresql.org/gitweb?p=pgbench-tools.git;a=blob_plain;f=tests/insert.sql;hb=945164aaadf50ea645750fe77c4ffcecafb5c353and

save that as "insert.sql", then do:

createdb pgbench
pgbench -i -s 10 pgbench
pgbench -s 10 -t 1000 -c 1 -f insert.sql

This will tell you how many insert statements/second are committing,
which is basically just a tiny write to the WAL followed by an fsync.
That's basically the fsync rate of the system, as long as you don't get
a checkpoint in the middle slowing it down.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: Excessive (and slow) fsync() within single transaction

От
Greg Smith
Дата:
Ron Mayer wrote:
Scott Marlowe wrote: 
Actually, it's usually the drives that lie about fsync, especially
consumer grade (and some server grade) SATA / PATA drives are known
for this.   
I'm still looking for any evidence of any drive that lies.
Is there actually a drive which claims to support the
IDE "FLUSH_CACHE_EXT" feature (IDE command 0xEA) that doesn't?!? 
You seem to think that a "lying drive" is one that doesn't flush its cache when you ask it to.  That's not the definition, and as you point out it's rare behavior--some crappy drives would do this many years ago when the drive flush was a new thing, nowadays that's unheard of.

But back in the day, when you wrote to a drive, it never returned success until that write was done.  Anything that doesn't have that behavior was considered a "lying drive".  It was something that consumer drives did but "enterprise" ones didn't.

Nowadays, most drives do this, and the various cache flush calls allow one to work around it.  But the drives still "lie" from some viewpoints, simply because the write calls don't wait for completion before returning success, the original criteria.  If you're naive and believe that anything the drive has said is written is actually on disk, which used to be how things worked, from that perspective all (well, almost all) drives lie.  Having that behavior documented and a workaround available doesn't change that.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: Excessive (and slow) fsync() within single transaction

От
Scott Marlowe
Дата:
On Wed, Dec 9, 2009 at 2:35 PM, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:
> Scott Marlowe wrote:
>> Actually, it's usually the drives that lie about fsync, especially
>> consumer grade (and some server grade) SATA / PATA drives are known
>> for this.
>
> I'm still looking for any evidence of any drive that lies.

I've tested drive in the past (when PATA roamed the land, and 80Gig
drives were huge) and could most certainly induce db corruption by
starting pgbench with -c 100, forcing a checkpoing, and then pulling
the plug.  Turning off the write cache (ok all cache I guess) on those
drives made them capable of surviving the power plug being pulled.

Re: Excessive (and slow) fsync() within single transaction

От
Stephen Tyler
Дата:

On Thu, Dec 10, 2009 at 8:38 AM, Greg Smith <greg@2ndquadrant.com> wrote:
PostgreSQL on OS X uses a special fsync method by default named 'fsync_writethrough'.  If you do this on your database you should be able to confirm it's using it:

show wal_sync_method;

This method includes a call to Apple's API to make sure the cache is properly flushed on the drives.  Lots more details about how all this works at http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm
 

 # show wal_sync_method;
 wal_sync_method
-----------------
 open_datasync

I don't think I have done anything to change this method from the default compile settings.

From the documentation:
http://www.postgresql.org/docs/8.4/static/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS

·         open_datasync (write WAL files with open() option O_DSYNC)

·         fdatasync (call fdatasync() at each commit)

·         fsync_writethrough (call fsync() at each commit, forcing write-through of any disk write cache)

·         fsync (call fsync() at each commit)

·         open_sync (write WAL files with open() option O_SYNC)

Not all of these choices are available on all platforms. The default is the first method in the above list that is supported by the platform.So apparently, my system supports open_datasync, and since it is the first method, that is the default.


Stephen


Re: Excessive (and slow) fsync() within single transaction

От
Vick Khera
Дата:
On Wed, Dec 9, 2009 at 6:24 AM, Stephen Tyler <stephen@stephen-tyler.com> wrote:
> Thank you for your suggestions.  I will turn on log_checkpoint next time I
> get the chance to restart.
>


You should be able to set this via a sighup to the server.  No restart required.

Re: Excessive (and slow) fsync() within single transaction

От
Stephen Tyler
Дата:
On Wed, Dec 9, 2009 at 12:57 PM, Greg Smith <greg@2ndquadrant.com> wrote:
You should turn on log_checkpoint in the postgresql.conf and confirm the slowdowns are happening around the same time as the checkpoint report gets written to the log files.

I have turned on log_checkpoints, and re-run the command.  Checkpoints are being written every 220 to 360 seconds.  About 80% are "time" and 20% "xlog".  Here are a representative sample:

10/12/09 5:12:59 PM    org.postgresql.postgres[445]    LOG:  checkpoint starting: time
10/12/09 5:16:41 PM    org.postgresql.postgres[445]    LOG:  checkpoint complete: wrote 133795 buffers (25.5%); 0 transaction log file(s) added, 0 removed, 98 recycled; write=112.281 s, sync=108.809 s, total=221.166 s
10/12/09 5:17:32 PM    org.postgresql.postgres[445]    LOG:  checkpoint starting: xlog
10/12/09 5:23:45 PM    org.postgresql.postgres[445]    LOG:  checkpoint complete: wrote 292601 buffers (55.8%); 0 transaction log file(s) added, 0 removed, 76 recycled; write=162.919 s, sync=209.277 s, total=372.229 s
10/12/09 5:23:45 PM    org.postgresql.postgres[445]    LOG:  checkpoint starting: time
10/12/09 5:27:58 PM    org.postgresql.postgres[445]    LOG:  checkpoint complete: wrote 113330 buffers (21.6%); 0 transaction log file(s) added, 0 removed, 128 recycled; write=165.252 s, sync=87.588 s, total=253.285 s
10/12/09 5:28:45 PM    org.postgresql.postgres[445]    LOG:  checkpoint starting: time
10/12/09 5:34:11 PM    org.postgresql.postgres[445]    LOG:  checkpoint complete: wrote 206947 buffers (39.5%); 0 transaction log file(s) added, 0 removed, 65 recycled; write=152.953 s, sync=172.978 s, total=326.460 s

So a "typical" checkpoint is around 200K buffers (1.5GBytes, 40% of max), taking around 150 seconds to write (10MByte/second, 1300 buffers/second), and around 150 seconds to sync.

The system pauses/freezes are much more frequent.  Roughly 20 to 60 seconds long, interspersed with a brief pulse of activity.  They align with the lengthy fsync calls reported by DTrace.  But there are many more fsync calls being made.  Most fsync calls return in a couple of milliseconds or microseconds.

As I have previously mentioned, my system is using the default "open_datasync" WAL sync method.  The test_fsync utility reports a small "open o_dsync, write" call returning in around 300 microseconds or less, both on the SSD RAID 0 array and a 7200rpm SATA HDD.  If that utility is to be believed, "open o_dsync, write" is not reaching the disk platters by the time it returns.

Why then does it take postgresql 200seconds to sync a checkpoint (or DTrace report 20 to 60 second fsync calls)?  The drives themselves have only smallish 32MB buffers.  I can write a 1.5GByte file (the size of the average checkpoint) in only 10 seconds, if I do it from outside postgresql.

Stephen

Re: Excessive (and slow) fsync() within single transaction

От
Greg Smith
Дата:
Stephen Tyler wrote:
> So a "typical" checkpoint is around 200K buffers (1.5GBytes, 40% of
> max), taking around 150 seconds to write (10MByte/second, 1300
> buffers/second), and around 150 seconds to sync.
Your problem may very well be plain old lack of disk I/O throughput, and
fsync forcing some caches to clear is just a symptom rather than the
root problem.  1.5GB per checkpoint spread out over ~5 minutes is still
a pretty high load of writes even for a SSD to deal with, given that
it's got to compute with foreground activity too.  I try to aim for less
than 250MB of checkpoint writes per 5 minute period with regular disks,
so even with your theoretically good random I/O I think you're having
checkpoints at least twice as often as your system can tolerate them.

> Why then does it take postgresql 200seconds to sync a checkpoint (or
> DTrace report 20 to 60 second fsync calls)?  The drives themselves
> have only smallish 32MB buffers.  I can write a 1.5GByte file (the
> size of the average checkpoint) in only 10 seconds, if I do it from
> outside postgresql.
The information being written out to the database has a large random I/O
component to it, so you can't compare it to sequential write speed at
all.  You said you can get 4K random writes/second, right?  That might
only be as little as 4K IOPS * 8K block = 32MB/s worth of write speed.
At that speed, 1.5GB will take over 45 seconds, not 10.

I don't really have any great suggestions for you from here.  You're
using not very well understood hardware on the 2nd worst of the popular
platforms for PostgreSQL from a performance perspective (with Windows by
far the worst), trying to push through what would be a stressful
workload even on the best of them.  I'm not too familiar with this area
because recommendation #1 if I ran into this situation would be "Don't
try that on OS X with HFS+".  Maybe there's some way to get more
performance out of there by tweaking the OS, I haven't had to do so
myself enough to know the details off the top of my head.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: Excessive (and slow) fsync() within single transaction

От
Greg Smith
Дата:
Stephen Tyler wrote:
I don't understand how maxwritten_clean could be as high as 95058, and increment at more than 1 per second.  This is a process count, not a buffer count?  How often is the background cleaner launched?  Does that mean I need to massively increase bgwriter_lru_maxpages, and other bgwriter params?  They are currently default values.
maxwritten_clean is a straight out count of times that even happened, not a buffer count.
The background writer runs at whatever frequency bgwriter_delay is set to, which defaults to 200ms for 5 executions/second.  You could increase bgwriter_lru_maxpages and the rest, but those actually make the system less efficient if you're having trouble just keeping up with checkpoint I/O.  They're more aimed to improve latency on systems where there's enough I/O to spare that you can write ahead a bit even if costs you an occasional penalty.  In your situation, I'd turn bgwriter_lru_maxpages=0 and just get rid of it altogether--it's really unlike it's helping, and it might be making things a tiny bit worse.

checkpoint_segments = 128
checkpoint_segments is large, but in the suggested range of 32 to 256 for large, write-heavy, batch-oriented systems.  I've used smaller values in the past, but I got lots of checkpoint warnings in the console.
I think that given your backing disk situation and the situation you're in, you could easily justify 256 on your system.

wal_buffers might be a bit high.  I don't fully understand the implications, and how to tune this value, even after reading the tuning guides.
It's not documented very well, that's for sure.  Setting it too high won't hurt you, just takes up a tiny amount of RAM.  I've been talking recently with people about increasing the standard recommendation for that to 16MB.

*) disable fsync...risky, but I bet will make the problem go away

I'm very tempted to live dangerously.  I would like to actually turn off fsync's during a transaction, and just use the fsync at the very end of the transaction to confirm that the transaction will be permanent.
The only other option that might help you out here is to turn off synchronous_commit for either the whole system, or just for transactions where durability isn't that important.  That introduces a small risk of data loss, but not the risk of corruption that turning fsync off altogether does.  Basically, it reduces the number of fsync's from transaction commits to be a fixed number per unit of time, rather than being proportional to the number of commits.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: Excessive (and slow) fsync() within single transaction

От
Bill Todd
Дата:
Greg Smith wrote:
> Stephen Tyler wrote:
>> So a "typical" checkpoint is around 200K buffers (1.5GBytes, 40% of
>> max), taking around 150 seconds to write (10MByte/second, 1300
>> buffers/second), and around 150 seconds to sync.
> Your problem may very well be plain old lack of disk I/O throughput,
> and fsync forcing some caches to clear is just a symptom rather than
> the root problem.  1.5GB per checkpoint spread out over ~5 minutes is
> still a pretty high load of writes even for a SSD to deal with, given
> that it's got to compute with foreground activity too.  I try to aim
> for less than 250MB of checkpoint writes per 5 minute period with
> regular disks, so even with your theoretically good random I/O I think
> you're having checkpoints at least twice as often as your system can
> tolerate them.
>
>> Why then does it take postgresql 200seconds to sync a checkpoint (or
>> DTrace report 20 to 60 second fsync calls)?  The drives themselves
>> have only smallish 32MB buffers.  I can write a 1.5GByte file (the
>> size of the average checkpoint) in only 10 seconds, if I do it from
>> outside postgresql.
> The information being written out to the database has a large random
> I/O component to it, so you can't compare it to sequential write speed
> at all.  You said you can get 4K random writes/second, right?  That
> might only be as little as 4K IOPS * 8K block = 32MB/s worth of write
> speed.  At that speed, 1.5GB will take over 45 seconds, not 10.
>
> I don't really have any great suggestions for you from here.  You're
> using not very well understood hardware on the 2nd worst of the
> popular platforms for PostgreSQL from a performance perspective (with
> Windows by far the worst), trying to push through what would be a
> stressful workload even on the best of them.  I'm not too familiar
> with this area because recommendation #1 if I ran into this situation
> would be "Don't try that on OS X with HFS+".  Maybe there's some way
> to get more performance out of there by tweaking the OS, I haven't had
> to do so myself enough to know the details off the top of my head.
>
What are the two best platforms for PostgreSQL in your opinion?

Bill

Re: Excessive (and slow) fsync() within single transaction

От
Greg Smith
Дата:
Bill Todd wrote:
> What are the two best platforms for PostgreSQL in your opinion?
There are three good platforms, each with a different set of strengths
and weaknesses:

-Linux:  Lots of driver support even for very recent things available.
Really broad universe of available add-on packages.  Performance is good
but a little "choppy"--it's not impossible to find situations where it
degrades in somewhat odd ways.  Stability is quite variable too, you
need to carefully stage your updates.

-FreeBSD:  More stable and predictable performance than Linux, which in
some cases even turns out to be better overall.  But driver support lags
behind, and in general you have to be more careful about what hardware
you match it with.  Much smaller user community when you do run into a
strange performance issue too.

-Solaris:  Performance is quite steady and predictable, but sometimes
not quite as fast as FreeBSD/Linux.  Driver support isn't impressive,
but it tends to cover the things you need for a boring database server
well enough (and the drivers don't break at random after kernel upgrades
like on Linux), and it's usually quite clear what will and won't work.
Has great, complete implementations of some high-end features
particularly well suited for larger systems like ZFS, DTrace (the
FreeBSD versions of those aren't quite as good yet).  Software package
availability is still quite weak compared with the alternatives, and
even after spending a fair amount of time with Solaris over the years I
still find it much harder to manage too.

I was picking on Mac OS X for a couple of reasons:

-Worst driver support of anything--very few viable options for RAID
controllers for example

-Kernel tuned for GUI apps rather than server ones.  I haven't seen much
on this recently, but older comparisons made the Darwin kernel look
extremely bad compared with Linux on process creation and complicated
memory allocation work in particular--stuff you need in a database but
don't really care about in most user apps.  I think people stopped even
running these tests because it made OS X look so bad.

-Traditional UNIX low-level tools and approaches are either weak or not
available.  I spent a fair amount of time looking around and I still
can't tell you how to tweak the write cache choices on OS X.  The mount
options I almost always tweak on other platforms don't even exist on
HFS+, you can't work around the problems with the friendly defaults.
Apple's vm_stat, iostat, and sar are pretty poor too.  Sure, you get
possibly the best DTrace analysis tool there is, and a nice GUI
monitoring app too; great.  That's only helpful if you get the basics
right first.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: Excessive (and slow) fsync() within single transaction

От
Stephen Tyler
Дата:
On Thu, Dec 10, 2009 at 9:21 PM, Stephen Tyler <stephen@stephen-tyler.com> wrote:
I have turned on log_checkpoints, and re-run the command.  Checkpoints are being written every 220 to 360 seconds.  About 80% are "time" and 20% "xlog".  Here are a representative sample:

10/12/09 5:12:59 PM    org.postgresql.postgres[445]    LOG:  checkpoint starting: time
10/12/09 5:16:41 PM    org.postgresql.postgres[445]    LOG:  checkpoint complete: wrote 133795 buffers (25.5%); 0 transaction log file(s) added, 0 removed, 98 recycled; write=112.281 s, sync=108.809 s, total=221.166 s
10/12/09 5:17:32 PM    org.postgresql.postgres[445]    LOG:  checkpoint starting: xlog
10/12/09 5:23:45 PM    org.postgresql.postgres[445]    LOG:  checkpoint complete: wrote 292601 buffers (55.8%); 0 transaction log file(s) added, 0 removed, 76 recycled; write=162.919 s, sync=209.277 s, total=372.229 s
10/12/09 5:23:45 PM    org.postgresql.postgres[445]    LOG:  checkpoint starting: time
10/12/09 5:27:58 PM    org.postgresql.postgres[445]    LOG:  checkpoint complete: wrote 113330 buffers (21.6%); 0 transaction log file(s) added, 0 removed, 128 recycled; write=165.252 s, sync=87.588 s, total=253.285 s
10/12/09 5:28:45 PM    org.postgresql.postgres[445]    LOG:  checkpoint starting: time
10/12/09 5:34:11 PM    org.postgresql.postgres[445]    LOG:  checkpoint complete: wrote 206947 buffers (39.5%); 0 transaction log file(s) added, 0 removed, 65 recycled; write=152.953 s, sync=172.978 s, total=326.460 s

The above was on a RAID 0 array of Intel X-25M drives.  Spec'd speed is 500MB/s read, 140MB/s write, 7000 random writes/s.  Observed speed (outside postgres, not a benchmark) is 420MB/s read, 140MB/s write, 4000 random writes/s.

I have now temporarily moved the entire database (and postgres data directories) to a single 7200rpm SATA drive (no RAID, no special drivers - just an ordinary WD Caviar Black 1TB).  Unfortunately the situation appears very similar (and even slower):

11/12/09 12:26:59 PM    org.postgresql.postgres[1652]    LOG:  checkpoint starting: time
11/12/09 12:33:43 PM    org.postgresql.postgres[1652]    LOG:  checkpoint complete: wrote 196843 buffers (37.5%); 0 transaction log file(s) added, 0 removed, 54 recycled; write=201.260 s, sync=202.874 s, total=404.224 s
11/12/09 12:33:43 PM    org.postgresql.postgres[1652]    LOG:  checkpoint starting: time
11/12/09 12:40:15 PM    org.postgresql.postgres[1652]    LOG:  checkpoint complete: wrote 150293 buffers (28.7%); 0 transaction log file(s) added, 0 removed, 81 recycled; write=198.739 s, sync=192.861 s, total=391.682 s
11/12/09 12:40:15 PM    org.postgresql.postgres[1652]    LOG:  checkpoint starting: time
11/12/09 12:46:07 PM    org.postgresql.postgres[1652]    LOG:  checkpoint complete: wrote 136752 buffers (26.1%); 0 transaction log file(s) added, 0 removed, 61 recycled; write=170.165 s, sync=181.921 s, total=352.199 s
11/12/09 12:46:07 PM    org.postgresql.postgres[1652]    LOG:  checkpoint starting: time
11/12/09 12:51:51 PM    org.postgresql.postgres[1652]    LOG:  checkpoint complete: wrote 146884 buffers (28.0%); 0 transaction log file(s) added, 0 removed, 57 recycled; write=187.262 s, sync=156.595 s, total=343.916 s

Some further observations:

1) The situation is a bit different during the first 15 minutes of running the huge query.  There is more CPU and disk activity (a lot of reading and writing - what I would call a busy disk).  But after a while (hours), it settles down to a pattern of almost zero read I/O.  Write I/O is a brief burst (<2 seconds, 2000 writes/sec, 30-50 MB/s) followed by about 30 seconds of 20-60 writes/sec (500KB/s).  CPU is negligible.

2) The HDD temperature is very low - identical to a neighbor that is completely idle.  I can't hear any disc rattling.

3) During these ~30 second activity pauses, access to the HDD from outside of postgres stalls.

4) DTrace is reporting fsync delays of 12 to 72 seconds (30 seconds typical) from calls made by postgres.  About 40% of calls return in < 10 microseconds, 50% return in 30 - 250 ms, 3% are in the seconds, and 2% are in the tens of seconds.

5) Although the checkpoint log reports say 190s of writing and 190s of sync, the activity pattern reported by Activity Monitor is 95% near idle, and 5% activity.

Stephen