Обсуждение: WAL and O_DIRECT

Поиск
Список
Период
Сортировка

WAL and O_DIRECT

От
Ravi Krishna
Дата:
As per PG 9.4 documentation:

wal_sync_method (enum)
Method used for forcing WAL updates out to disk. If fsync is off then this setting is irrelevant, since WAL file
updateswill not be forced out at all. Possible values are: 

open_datasync (write WAL files with open() option O_DSYNC)

fdatasync (call fdatasync() at each commit)

fsync (call fsync() at each commit)

fsync_writethrough (call fsync() at each commit, forcing write-through of any disk write cache)

open_sync (write WAL files with open() option O_SYNC)

The open_* options also use O_DIRECT if available. Not all of these choices are available on all platforms. The default
isthe first method in the above list that is supported by the platform, except that fdatasync is the default on Linux.
Thedefault is not necessarily ideal; it might be necessary to change this setting or other aspects of your system
configurationin order to create a crash-safe configuration or achieve optimal performance. These aspects are discussed
inSection 29.1. This parameter can only be set in the postgresql.conf file or on the server command line. 
===============

Our wal_sync_metho is opn_sync.

In RHEL 6.4, strace command shows PG opening the WAL files in the following mode

1431610784.573828 open("pg_xlog/0000000100000047000000A5", O_RDWR|O_DSYNC)

Why is O_DIRECT not used, despite the documentation mentioning otherwise?

On the same host, DB2 opens active logs (WAL) as follows
open("/bb/db/pgentdb/data001/db2/db2/NODE0000/SQL00001/LOGSTREAM0000/S0000001.LOG", O_RDWR|O_DSYNC|O_DIRECT)

We were expecting the same in PG too.

We are bench-marking PG vs DB2 for a new app.

Thanks.


Re: WAL and O_DIRECT

От
Tom Lane
Дата:
Ravi Krishna <s.ravikrishna@aim.com> writes:
> Why is O_DIRECT not used, despite the documentation mentioning otherwise?

You've not shown us all your settings, but this comment in xlog.c might
explain it:

     * Optimize writes by bypassing kernel cache with O_DIRECT when using
     * O_SYNC/O_FSYNC and O_DSYNC.  But only if archiving and streaming are
     * disabled, otherwise the archive command or walsender process will read
     * the WAL soon after writing it, which is guaranteed to cause a physical
     * read if we bypassed the kernel cache. We also skip the
     * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
     * reason.

            regards, tom lane


Re: WAL and O_DIRECT

От
Ravi Krishna
Дата:
Aha that pretty much explains it. Yes we are using streaming replication.

However our DB2 folks are raising a concern that PG WAL writes may not be crash safe, unless we are using write back technology in SAN or SSD , which we are using.


-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Ravi Krishna <s.ravikrishna@aim.com>
Cc: pgsql-admin <pgsql-admin@postgresql.org>
Sent: Thu, May 14, 2015 11:11 am
Subject: Re: [ADMIN] WAL and O_DIRECT

Ravi Krishna <s.ravikrishna@aim.com> writes:
> Why is O_DIRECT not used,
despite the documentation mentioning otherwise?

You've not shown us all your
settings, but this comment in xlog.c might
explain it:
    * Optimize
writes by bypassing kernel cache with O_DIRECT when using    * O_SYNC/O_FSYNC
and O_DSYNC.  But only if archiving and streaming are    * disabled,
otherwise the archive command or walsender process will read    * the WAL
soon after writing it, which is guaranteed to cause a physical    * read if
we bypassed the kernel cache. We also skip the    *
posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same    *
reason.
		regards, tom lane

Re: WAL and O_DIRECT

От
Bruce Momjian
Дата:
On Thu, May 14, 2015 at 11:32:12AM -0400, Ravi Krishna wrote:
> Aha that pretty much explains it. Yes we are using streaming replication.
>
> However our DB2 folks are raising a concern that PG WAL writes may not be crash
> safe, unless we are using write back technology in SAN or SSD , which we are
> using.

We write WAL to the kernel, then issue fsync.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +


Re: WAL and O_DIRECT

От
Tom Lane
Дата:
Ravi Krishna <s.ravikrishna@aim.com> writes:
> However our DB2 folks are raising a concern that PG WAL writes may not be crash safe, unless we are using write back
technologyin SAN or SSD , which we are using. 

What's your point exactly?  If the underlying hardware does not provide
durable writes, there's nothing PG (or DB2) can do to fix that.

            regards, tom lane


Re: WAL and O_DIRECT

От
Ravi Krishna
Дата:

>> However our DB2 folks are raising a concern that PG WAL writes may not be crash safe, unless we are using
>> write back technology in SAN or SSD , which we are using.

> What's your point exactly?  If the underlying hardware does not provide durable writes, there's
> nothing PG (or DB2) can do to fix that.

Am I right in concluding that PG WAL writes without underlying h/w caching is not crash proof. 
Fortunately these days caching is  ubiquitous in all SSD/SAN technology. Both Oracle and DB2 always open WAL 
logs in O_DIRECT. Is this thinking outdated with modern technology which caches writes. Wonder why Oracle/DB2 
are not making O_DIRECT optional. I am sure it will increase the write performance.


Re: WAL and O_DIRECT

От
Tom Lane
Дата:
Ravi Krishna <s.ravikrishna@aim.com> writes:
> Am I right in concluding that PG WAL writes without underlying h/w caching is not crash proof.

We expect that the hardware+OS honors fsync() correctly, ie puts the data
on durable storage before fsync returns.  That does not mean anything one
way or the other about caching, it just means that the filesystem stack
has to be correctly implemented.

            regards, tom lane


Re: WAL and O_DIRECT

От
Bruce Momjian
Дата:
On Thu, May 14, 2015 at 12:07:04PM -0400, Ravi Krishna wrote:
> >> However our DB2 folks are raising a concern that PG WAL writes may not be crash safe, unless we are using
> >> write back technology in SAN or SSD , which we are using.
>
> > What's your point exactly?  If the underlying hardware does not provide durable writes, there's
> > nothing PG (or DB2) can do to fix that.
>
>
> Am I right in concluding that PG WAL writes without underlying h/w caching is not crash proof.
> Fortunately these days caching is  ubiquitous in all SSD/SAN technology. Both Oracle and DB2 always open WAL
> logs in O_DIRECT. Is this thinking outdated with modern technology which caches writes. Wonder why Oracle/DB2
> are not making O_DIRECT optional. I am sure it will increase the write performance.

Basically, O_DIRECT writes through the OS catch directly to the storage.
Postgres writes to the OS cache, then uses fsync() or another OS call to
flush that OS write to the storage --- we just do it in two parts.

We turn off O_DIRECT for WAL writes because we know another process is
going to read it soon, so in that case, we fall back the two-part
solution of OS write and fsync-like OS call.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +


Re: WAL and O_DIRECT

От
Ravi Krishna
Дата:
Thanks Bruce and Tom. That pretty much explains it.


-----Original Message-----
From: Bruce Momjian <bruce@momjian.us>
To: Ravi Krishna <s.ravikrishna@aim.com>
Cc: tgl <tgl@sss.pgh.pa.us>; pgsql-admin <pgsql-admin@postgresql.org>
Sent: Thu, May 14, 2015 12:26 pm
Subject: Re: [ADMIN] WAL and O_DIRECT


On Thu, May 14, 2015 at 12:07:04PM -0400, Ravi Krishna wrote:> >> However ourDB2 folks are raising a concern that PG
WALwrites may not be crash safe, unlesswe are using> >> write back technology in SAN or SSD , which we are using.>> >
What'syour point exactly?  If the underlying hardware does not providedurable writes, there's> > nothing PG (or DB2)
cando to fix that.> > >Am I right in concluding that PG WAL writes without underlying h/w caching isnot crash proof.>
Fortunatelythese days caching is  ubiquitous in all SSD/SANtechnology. Both Oracle and DB2 always open WAL> logs in
O_DIRECT.Is thisthinking outdated with modern technology which caches writes. Wonder whyOracle/DB2> are not making
O_DIRECToptional. I am sure it will increase thewrite performance.Basically, O_DIRECT writes through the OS catch
directlytothe storage.Postgres writes to the OS cache, then uses fsync() or anotherOS call toflush that OS write to the
storage--- we just do it in twoparts.We turn off O_DIRECT for WAL writes because we know another processisgoing to read
itsoon, so in that case, we fall back the two-partsolutionof OS write and fsync-like OS call.--   Bruce Momjian
<bruce@momjian.us>http://momjian.us  EnterpriseDB                            http://enterprisedb.com  + Everyone has
theirown god. +