Обсуждение: WAL sync behaviour

Поиск
Список
Период
Сортировка

WAL sync behaviour

От
"Steinar H. Gunderson"
Дата:
Hi,

We're having problems with our PostgreSQL server using forever for simple
queries, even when there's little load -- or rather, the transactions seem
to take forever to commit. We're using 8.1 (yay!) on a single Opteron, with
WAL on the system two-disk (software) RAID-1, separate from the database
four-disk RAID-10. All drives are 10000rpm SCSI disks, with write cache
turned off; we value our data :-) We're running Linux 2.6.13.4, with 64-bit
kernel but 32-bit userspace.

The main oddity is that simple transactions take forever to execute, even on
small tables with no triggers. A COMMIT on an otherwise idle system with one
row to commit can take anything from 60-200ms to execute, which seems quite
excessive -- sometimes (and I've verified that there's not a checkpoint or
vacuum going on at that time), transactions seem to pile up and you get
behaviour like:

LOG:  duration: 836.004 ms  statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.100', hostname =
'mivu-03.samfundet.no'WHERE posid = 'mivu-03' 
LOG:  duration: 753.545 ms  statement: UPDATE mivu3ping SET pingtime = NOW(), ip =  '129.241.93.110', hostname =
'mivu-13.samfundet.no'WHERE posid = 'mivu-13' 
LOG:  duration: 567.914 ms  statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.109', hostname =
'mivu-12.samfundet.no'WHERE posid = 'mivu-12' 
LOG:  duration: 515.013 ms  statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.105', hostname =
'mivu-08.samfundet.no'WHERE posid = 'mivu-08' 
LOG:  duration: 427.541 ms  statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.104', hostname =
'mivu-07.samfundet.no'WHERE posid = 'mivu-07' 
LOG:  duration: 383.314 ms  statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.107', hostname =
'mivu-10.samfundet.no'WHERE posid = 'mivu-10' 
LOG:  duration: 348.965 ms  statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.103', hostname =
'mivu-06.samfundet.no'WHERE posid = 'mivu-06' 
LOG:  duration: 314.465 ms  statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.101', hostname =
'mivu-04.samfundet.no'WHERE posid = 'mivu-04' 
LOG:  duration: 824.893 ms  statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.106', hostname =
'mivu-09.samfundet.no'WHERE posid = 'mivu-09' 

Sometimes, six or seven of these transactions even seem to wait for the same
thing, reporting finishing times of something like 6, 5, 4, 3 and 2 seconds
right after each other in the log! This is not a highly loaded system, so I
don't really see why this should happen. (We had the same problems with 7.4,
but if my imagination isn't playing games on me, they seem to have become
slightly worse with 8.1.)

strace shows that fdatasync() takes almost all that time, but when I run my own
fdatasync() test program on the same file system, I can consistently sync a
file (after an 8kB write) in about 30ms every time, so I don't really know why
this would be so much slower with PostgreSQL. We're using the cfq scheduler,
but deadline and noop give about the same results.

Setting wal_sync_method = open_sync seems to improve the situation
dramatically on simple commits; we get down into the 10-30ms range on an idle
system. OTOH, behaviour seems to get slightly worse when there's more stuff
going on, and we still get the 300ms transactions in batches every now and
then.

Any good ideas?

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: WAL sync behaviour

От
Richard Huxton
Дата:
Steinar H. Gunderson wrote:
> Hi,
>
> We're having problems with our PostgreSQL server using forever for simple
> queries, even when there's little load -- or rather, the transactions seem
> to take forever to commit. We're using 8.1 (yay!) on a single Opteron, with
> WAL on the system two-disk (software) RAID-1, separate from the database
> four-disk RAID-10. All drives are 10000rpm SCSI disks, with write cache
> turned off; we value our data :-) We're running Linux 2.6.13.4, with 64-bit
> kernel but 32-bit userspace.

You're beyond my area of expertise, but I do know that someone's going
to ask what filesystem this is (ext2/xfs/etc). And probably to see the
strace too.

Hmm - the only things I can think to check:
Do vmstat/iostat show any unusual activity?
Are your system logs on these disks too?
Could it be the journalling on the fs clashing with the WAL?

--
   Richard Huxton
   Archonet Ltd

Re: WAL sync behaviour

От
"Steinar H. Gunderson"
Дата:
On Thu, Nov 10, 2005 at 02:14:30PM +0000, Richard Huxton wrote:
> You're beyond my area of expertise, but I do know that someone's going
> to ask what filesystem this is (ext2/xfs/etc).

Ah, yes, I forgot -- it's ext3. We're considering simply moving the WAL onto
a separate partition (with data=writeback and noatime) if that can help us
any.

> And probably to see the strace too.

The strace with wal_sync_method = fdatasync goes like this (I attach just
before I do the commit):

cirkus:~> sudo strace -T -p 15718
Process 15718 attached - interrupt to quit
read(8, "\27\3\1\0 ", 5)                = 5 <2.635856>
read(8, "\336\333\24KB\325Ga\324\264[\307v\254h\254\350\20\220a"..., 32) = 32 <0.000031>
read(8, "\27\3\1\0000", 5)              = 5 <0.000027>
read(8, "$E\217<z\350bI\2217\317\3662\301\273\233\17\177\256\225"..., 48) = 48 <0.000026>
send(7, "\3\0\0\0\30\0\0\0\20\0\0\0f=\0\0commit;\0", 24, 0) = 24 <0.000071>
gettimeofday({1131632603, 187599}, NULL) = 0 <0.000026>
time(NULL)                              = 1131632603 <0.000027>
open("pg_xlog/0000000100000000000000A2", O_RDWR|O_LARGEFILE) = 14 <0.000039>
_llseek(14, 12500992, [12500992], SEEK_SET) = 0 <0.000026>
write(14, "]\320\1\0\1\0\0\0\0\0\0\0\0\300\276\242\362\0\0\0\31\0"..., 8192) = 8192 <0.000057>
fdatasync(14)                           = 0 <0.260194>
gettimeofday({1131632603, 448459}, NULL) = 0 <0.000034>
time(NULL)                              = 1131632603 <0.000027>
time([1131632603])                      = 1131632603 <0.000025>
getpid()                                = 15718 <0.000025>
rt_sigaction(SIGPIPE, {0x558a27e0, [], 0}, {SIG_IGN}, 8) = 0 <0.000029>
send(3, "<134>Nov 10 15:23:23 postgres[15"..., 121, 0) = 121 <0.000032>
rt_sigaction(SIGPIPE, {SIG_IGN}, NULL, 8) = 0 <0.000029>
send(7, "\4\0\0\0\330\3\0\0\20\0\0\0f=\0\0\247@\0\0\16\0\0\0\1\0"..., 984, 0) = 984 <0.000076>
send(7, "\4\0\0\0\330\3\0\0\20\0\0\0f=\0\0\247@\0\0\16\0\0\0\0\0"..., 984, 0) = 984 <0.000051>
send(7, "\4\0\0\0\330\3\0\0\20\0\0\0f=\0\0\247@\0\0\16\0\0\0\0\0"..., 984, 0) = 984 <0.000050>
send(7, "\4\0\0\0\250\0\0\0\20\0\0\0f=\0\0\247@\0\0\2\0\0\0\0\0"..., 168, 0) = 168 <0.000050>
send(7, "\4\0\0\0\250\0\0\0\20\0\0\0f=\0\0\0\0\0\0\2\0\0\0\0\0\0"..., 168, 0) = 168 <0.000049>
send(7, "\3\0\0\0\27\0\0\0\20\0\0\0f=\0\0<IDLE>\0", 23, 0) = 23 <0.000047>
write(8, "\27\3\1\0 B\260\253rq)\232\265o\225\272\235\v\375\31\323"..., 90) = 90 <0.000229>
read(8,  <unfinished ...>
Process 15718 detached

> Do vmstat/iostat show any unusual activity?

No, there's not much activity. In fact, it's close to idle.

> Are your system logs on these disks too?

Yes, they are, but nothing much is logged, really -- and sync is off for most
of the logs in syslogd.

> Could it be the journalling on the fs clashing with the WAL?

Unsure -- that's what I was hoping to get some information on :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: WAL sync behaviour

От
Michael Stone
Дата:
On Thu, Nov 10, 2005 at 03:25:35PM +0100, Steinar H. Gunderson wrote:
>Ah, yes, I forgot -- it's ext3. We're considering simply moving the WAL onto
>a separate partition (with data=writeback and noatime) if that can help us
>any.

There's no reason to use a journaled filesystem for the wal. Use ext2 in
preference to ext3.

Mike Stone

Re: WAL sync behaviour

От
Scott Marlowe
Дата:
On Thu, 2005-11-10 at 08:43, Michael Stone wrote:
> On Thu, Nov 10, 2005 at 03:25:35PM +0100, Steinar H. Gunderson wrote:
> >Ah, yes, I forgot -- it's ext3. We're considering simply moving the WAL onto
> >a separate partition (with data=writeback and noatime) if that can help us
> >any.
>
> There's no reason to use a journaled filesystem for the wal. Use ext2 in
> preference to ext3.

Not from what I understood.  Ext2 can't guarantee that your data will
even be there in any form after a crash.  I believe only metadata
journaling is needed though.

Re: WAL sync behaviour

От
Michael Stone
Дата:
On Thu, Nov 10, 2005 at 09:52:38AM -0600, Scott Marlowe wrote:
>Not from what I understood.  Ext2 can't guarantee that your data will
>even be there in any form after a crash.

It can if you sync the data. (Which is the whole point of the WAL.)

>I believe only metadata journaling is needed though.

If you don't sync, metadata journaling doesn't do anything to guarantee
that your data will be there, so you're adding no data security in the
non-synchronous-write case. (Which is irrelevant for the WAL.)

What metadata journalling gets you is fast recovery from crashes by
avoiding a fsck. The fsck time is related to the number of files on a
filesystem--so it's generally pretty quick on a WAL partition anyway.

Mike Stone

Re: WAL sync behaviour

От
Tom Lane
Дата:
Scott Marlowe <smarlowe@g2switchworks.com> writes:
> On Thu, 2005-11-10 at 08:43, Michael Stone wrote:
>> There's no reason to use a journaled filesystem for the wal. Use ext2 in
>> preference to ext3.

> Not from what I understood.  Ext2 can't guarantee that your data will
> even be there in any form after a crash.  I believe only metadata
> journaling is needed though.

No, Mike is right: for WAL you shouldn't need any journaling.  This is
because we zero out *and fsync* an entire WAL file before we ever
consider putting live WAL data in it.  During live use of a WAL file,
its metadata is not changing.  As long as the filesystem follows
the minimal rule of syncing metadata about a file when it fsyncs the
file, all the live WAL files should survive crashes OK.

We can afford to do this mainly because WAL files can normally be
recycled instead of created afresh, so the zero-out overhead doesn't
get paid during normal operation.

You do need metadata journaling for all non-WAL PG files, since we don't
fsync them every time we extend them; which means the filesystem could
lose track of which disk blocks belong to such a file, if it's not
journaled.

            regards, tom lane

Re: WAL sync behaviour

От
Scott Marlowe
Дата:
On Thu, 2005-11-10 at 10:39, Tom Lane wrote:
> Scott Marlowe <smarlowe@g2switchworks.com> writes:
> > On Thu, 2005-11-10 at 08:43, Michael Stone wrote:
> >> There's no reason to use a journaled filesystem for the wal. Use ext2 in
> >> preference to ext3.
>
> > Not from what I understood.  Ext2 can't guarantee that your data will
> > even be there in any form after a crash.  I believe only metadata
> > journaling is needed though.
>
> No, Mike is right: for WAL you shouldn't need any journaling.  This is
> because we zero out *and fsync* an entire WAL file before we ever
> consider putting live WAL data in it.  During live use of a WAL file,
> its metadata is not changing.  As long as the filesystem follows
> the minimal rule of syncing metadata about a file when it fsyncs the
> file, all the live WAL files should survive crashes OK.
>
> We can afford to do this mainly because WAL files can normally be
> recycled instead of created afresh, so the zero-out overhead doesn't
> get paid during normal operation.
>
> You do need metadata journaling for all non-WAL PG files, since we don't
> fsync them every time we extend them; which means the filesystem could
> lose track of which disk blocks belong to such a file, if it's not
> journaled.

Thanks for the clarification!  Nice to know I can setup an ext2
partition for my WAL files then.  Is this in the docs anywhere?

Re: WAL sync behaviour

От
mark@mark.mielke.cc
Дата:
On Thu, Nov 10, 2005 at 11:39:34AM -0500, Tom Lane wrote:
> No, Mike is right: for WAL you shouldn't need any journaling.  This is
> because we zero out *and fsync* an entire WAL file before we ever
> consider putting live WAL data in it.  During live use of a WAL file,
> its metadata is not changing.  As long as the filesystem follows
> the minimal rule of syncing metadata about a file when it fsyncs the
> file, all the live WAL files should survive crashes OK.

Yes, with emphasis on the zero out... :-)

> You do need metadata journaling for all non-WAL PG files, since we don't
> fsync them every time we extend them; which means the filesystem could
> lose track of which disk blocks belong to such a file, if it's not
> journaled.

I think there may be theoretical problems with regard to the ordering
of the fsync operation, for files that are not pre-allocated. For
example, if a new block is allocated - there are two blocks that need
to be updated.  The indirect reference block (or inode block, if block
references fit into the inode entry), and the block itself. If the
indirect reference block is written first, before the data block, the
state of the disk is inconsistent. This would be a crash during the
fsync() operation. The metadata journalling can ensure that the data
block is allocated first, and then all the necessary references
updated, allowing for the operation to be incomplete and rolled back,
or committed in full.

Or, that is my understanding, anyways, and this is why I would not use
ext2 for the database, even if it was claimed that fsync() was used.

For WAL, with pre-allocated zero blocks? Sure. Ext2... :-)

mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/


Re: WAL sync behaviour

От
Tom Lane
Дата:
Scott Marlowe <smarlowe@g2switchworks.com> writes:
> Thanks for the clarification!  Nice to know I can setup an ext2
> partition for my WAL files then.  Is this in the docs anywhere?

Don't think so ... want to write something up?  Hard part is to
figure out where to put it ...

            regards, tom lane

Re: WAL sync behaviour

От
"Steinar H. Gunderson"
Дата:
On Thu, Nov 10, 2005 at 12:44:15PM -0500, Tom Lane wrote:
> Don't think so ... want to write something up?  Hard part is to
> figure out where to put it ...

To be honest, I think we could use a "newbie's guide to PostgreSQL
performance tuning". I've seen rather good guides for query tuning, and
guides for general performance tuning, but none that really cover both in a
coherent way. (Also, many of the ones I've seen start getting rather dated;
after bitmap index scans arrived, for instance, many of the rules with regard
to index planning probably changed.)

I'd guess http://www.powerpostgresql.com/PerfList is a rather good start for
the second part (and it's AFAICS under a free license); having something like
that in the docs (or some other document) would probably be a good start.

/* Steinar */
--
Homepage: http://www.sesse.net/