Обсуждение: FAQ Q

Поиск
Список
Период
Сортировка

FAQ Q

От
Tim Ellis
Дата:
From the FAQ:

---------------------------------------------------------------------
PostgreSQL runs in two modes. Normal fsync mode flushes every completed
transaction to disk, guaranteeing that if the OS crashes or loses power in
the next few seconds, all your data is safely stored on disk. In this
mode, we are slower than most commercial databases, partly because few of
them do such conservative flushing to disk in their default modes. In
no-fsync mode, we are usually faster than commercial databases, though in
this mode, an OS crash could cause data corruption. We are working to
provide an intermediate mode that suffers less performance overhead than
full fsync mode, and will allow data integrity within 30 seconds of an OS
crash.
---------------------------------------------------------------------

Has this changed since the FAQ?

I understand that Oracle & Sybase (at least) have guaranteed data
consistency in case of a crash. From this FAQ, there is a suggestion that
at least some commercial RDBMS doesn't even have guaranteed consistency?

In Sybase/Oracle this guaranteed consistency + performance is implemented
by ensuring that the transaction is guaranteed to be flushed to disk
before the data page is flushed, but that neither is guaranteed to be
flushed at all unless your application does a specific COMMIT.

Therefore, even though you might lose some transactions, you will never
lose internal consistency when using transactions. Also, you will not
receive a result from COMMIT until at least the transaction is flushed.

It would seem the FAQ is saying we always flush every transaction all the
time in fsync mode, and in no-fsync mode we don't really guarantee that
the transaction will be flushed before the data page is flushed.

Am I reading this correctly?

(in summation...) So if data consistency is the most important thing to
me, performance be damned, I still, as of 7.2.1, want to run in fsync
mode?

Thanks,
--
Tim Ellis
DBA, Gamet

Re: FAQ Q

От
Tom Lane
Дата:
Tim Ellis <Tim.Ellis@gamet.com> writes:
> (in summation...) So if data consistency is the most important thing to
> me, performance be damned, I still, as of 7.2.1, want to run in fsync
> mode?

Yup.  no-fsync is only suitable if you trust your OS and power supply.

You do not have to trust Postgres itself: in all cases we push the log
entries out to the OS before declaring a transaction committed.  The
question at hand is whether we use fsync() or other methods to try to
force the OS to write to disk before we report the transaction
committed.  Without that, a system-level crash immediately after a
commit report might mean the "committed" transaction isn't reflected as
committed on disk.

The performance cost of fsync is not nearly what it used to be pre-7.1,
btw, because we need flush only the WAL log file not data files.
(In case of a crash, any missing data-file updates will be reconstructed
from the recent WAL entries during restart.)  So the needed disk head
movement is a lot less than it used to be --- especially if you've
arranged for pg_xlog to live on its own disk.  I don't think there's any
longer a good argument for considering no-fsync in production servers,
even though you might well still choose it for development systems.

            regards, tom lane

Re: FAQ Q

От
Tim Ellis
Дата:
On Mon, 10 Jun 2002 17:16:23 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> You do not have to trust Postgres itself: in all cases we push the log
> entries out to the OS before declaring a transaction committed.

In all cases while fsync mode is enabled, not in all cases (including
no-fsync) right?

> question at hand is whether we use fsync() or other methods to try to
> force the OS to write to disk before we report the transaction
> committed.  Without that, a system-level crash immediately after a
> commit report might mean the "committed" transaction isn't reflected as
> committed on disk.

So no-fsync simply means that "commit" != "sync'd to disk"? How is this
different than the big boys do it? From Sybase internals classes and
Oracle documentation, I know that when you say "COMMIT" you don't get any
response back until the RDBMS has written that transaction TO DISK.

Is the FAQ being too generous in speed and too unfair in recoverability to
the commercial offerings here?

From the FAQ: "Normal fsync mode flushes every completed transaction to
disk, guaranteeing that if the OS crashes or loses power in the next few
seconds, all your data is safely stored on disk. In this mode, we are
slower than most commercial databases, partly because few of them do such
conservative flushing to disk in their default modes."

Seems we're claiming that most commercial databases don't actually flush
anything to disk before returning a result to a COMMIT, and at the same
time that therefore they're faster.

My training says Sybase/DB2/Oracle/Informix won't be any less conservative
and at the same time, won't be any faster.

Sybase at least just tries to avoid writing tranlogs to disk except in
batches, so when you say COMMIT, it pauses until X amount of time passes
or Y number of committed transactions have piled up, then it writes those
trans to disk. But your client is blocked on the COMMIT until the
transaction actually makes it to disk.

Perhaps we are including in "most commercial databases" several RDBMSs not
Sybase/Oracle/DB2/Informix?

> The performance cost of fsync is not nearly what it used to be pre-7.1,
> btw, because we need flush only the WAL log file not data files.

Oh, so here's where my confusion continues. The fsync pre-7.1 was flushing
data pages, whereas 7.1/7.1 only flushes WAL? So basically the FAQ quotes
above and previously apply only to pre-7.1, and not 7.1/7.2?

> (In case of a crash, any missing data-file updates will be reconstructed
> from the recent WAL entries during restart.)

This is what I'd thought the FAQ was referring to when it said "all your
data is safely stored on disk." So am I right in saying:

In 7.1/7.2, when you COMMIT, you are guaranteed that WAL (and probably
nothing but WAL) is written to disk before you get a return and that
therefore if your instance immediately thereafter crashes, that after
recovery (which is required), you will have a consistent (not-corrupted)
database.

???

Cheers,
Tim Ellis
DBA, Gamet

ps -- I'm trying to find a writeup that kind'f compares Postgres to
Sybase/Oracle/DB2/Informix and/or MySQL to all the aforementioned
databases, but can find nothing substantial. I'm interested in
ADMINISTRATIVE viewpoints. Most of the info on the web concentrates on
what DEVELOPERS think of the aforementioned offerings, but nothing about
what ADMINISTRATORS think. For a good example of a bad discussion, see the
Slashdot thread on Postgres vs. MySQL.

Any hints for me?

FAQ 1.14 (at http://postgresql.org/docs/faq-english.html) is close to what
I'm talking about, but I'd love something far more in-depth and which
compares to specific RDBMSs (ie: isn't politically correct).

Re: FAQ Q

От
Tom Lane
Дата:
Tim Ellis <Tim.Ellis@gamet.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> You do not have to trust Postgres itself: in all cases we push the log
>> entries out to the OS before declaring a transaction committed.

> In all cases while fsync mode is enabled, not in all cases (including
> no-fsync) right?

All cases is what I said, and all cases is what I meant.  fsync mode
only controls whether there's an fsync() after the write().

> So no-fsync simply means that "commit" != "sync'd to disk"? How is this
> different than the big boys do it? From Sybase internals classes and
> Oracle documentation, I know that when you say "COMMIT" you don't get any
> response back until the RDBMS has written that transaction TO DISK.

Or thinks it has.  One of the points that people don't much like to talk
about is the difficulty of knowing how far the data has really been
pushed.  In fsync mode we report commit when the kernel has told us
it's written the data; we have no way of knowing whether the kernel
lied, and even less way of knowing whether the disk drive has actually
written the information or only cached it on the controller board.
On modern disk drives the truth is likely to be that the bits have
only gone as far as the drive controller; so you may lose data if
you lose power and the power supply does not have enough capacity
to hold up the drive while it's finishing its pending writes.

However, the big boys running on the same hardware are going to have the
same problem.

If you can figure out how to configure your drives not to report write
complete until it's really complete, then you can feel secure with
either Postgres or the big boys.  Or you can buy a UPS and make sure you
are configured to shut down before the UPS runs out of steam.

In no-fsync mode we report commit after pushing the data out to the
kernel, but we don't try to force the kernel to push it out to disk.
So you are safe against a Postgres crash, but not against kernel or
hardware failure.  Your transaction could be lost if the system dies
before the kernel gets around to syncing it out to disk (typically
30 sec max, on most Unixen).

> Is the FAQ being too generous in speed and too unfair in recoverability to
> the commercial offerings here?

I didn't write the FAQ and won't take responsibility for its claims
about commercial databases.  I really don't know what tradeoffs they
offer in this area.

> Sybase at least just tries to avoid writing tranlogs to disk except in
> batches, so when you say COMMIT, it pauses until X amount of time passes
> or Y number of committed transactions have piled up, then it writes those
> trans to disk. But your client is blocked on the COMMIT until the
> transaction actually makes it to disk.

We have that too, although it's not on by default, and probably needs
further tuning work.

> In 7.1/7.2, when you COMMIT, you are guaranteed that WAL (and probably
> nothing but WAL) is written to disk before you get a return and that
> therefore if your instance immediately thereafter crashes, that after
> recovery (which is required), you will have a consistent (not-corrupted)
> database.

Modulo the above issues, yes.  Note also that WAL fsync protects you
against data corruption in the case of a mid-transaction system-level
failure: if all the WAL updates have made it to disk, then we will be
able to fix any incomplete or missing writes in the data files.  Without
fsync there is a distinct risk of corrupted data.  This has to do with
forcing fsync on the WAL files before we start to modify data pages
intra-transaction.

            regards, tom lane

Re: FAQ Q

От
Bruce Momjian
Дата:
Tom Lane wrote:
> However, the big boys running on the same hardware are going to have the
> same problem.
>
> If you can figure out how to configure your drives not to report write
> complete until it's really complete, then you can feel secure with
> either Postgres or the big boys.  Or you can buy a UPS and make sure you
> are configured to shut down before the UPS runs out of steam.
>
> In no-fsync mode we report commit after pushing the data out to the
> kernel, but we don't try to force the kernel to push it out to disk.
> So you are safe against a Postgres crash, but not against kernel or
> hardware failure.  Your transaction could be lost if the system dies
> before the kernel gets around to syncing it out to disk (typically
> 30 sec max, on most Unixen).
>
> > Is the FAQ being too generous in speed and too unfair in recoverability to
> > the commercial offerings here?
>
> I didn't write the FAQ and won't take responsibility for its claims
> about commercial databases.  I really don't know what tradeoffs they
> offer in this area.

I have updated the FAQ to remove the reference to fsync, because with WAL it
isn't as much of an issue:

Performance

PostgreSQL has performance similar to other commercial and open source
databases. it is faster for some things, slower for others. In
comparison to MySQL or leaner database systems, we are slower on
inserts/updates because of transaction overhead. Of course, MySQL does
not have any of the features mentioned in the Features section above. We
are built for reliability and features, though we continue to improve
performance in every release. There is an interesting Web page comparing
PostgreSQL to MySQL at http://openacs.org/why-not-mysql.html

>
> > Sybase at least just tries to avoid writing tranlogs to disk except in
> > batches, so when you say COMMIT, it pauses until X amount of time passes
> > or Y number of committed transactions have piled up, then it writes those
> > trans to disk. But your client is blocked on the COMMIT until the
> > transaction actually makes it to disk.

Informix has two modes, buffered logging, and unbuffered logging.  The
PostgreSQL setup is unbuffered logging (from
http://nasis.nrcs.usda.gov/archive/logbuf.html#LoggingStatus):

Unbuffered Logging

An unbuffered logging status means that the database writes each
transactions result to the logical log file as soon as that transaction
completes. This mode increases the amount of data that might be
recovered during the fast recovery phase. All the transactions that make
it to disk can be recovered. It is slower because it requires more
frequent writes to the database and it takes more space because all
changes are logged.

Buffered Logging

Buffered logging does not actually reduce the amount of data that is
written to the logical log it just reduces how often changes are written
to disk. With buffered logging the Informix OnLine server waits until an
internal memory buffer is nearly filled before writing the data to disk.
This means that even though buffered logging will eventually write the
same data as unbuffered logging it will take fewer disk access. That
means that the logging should take less time. Buffered logging is not
quite as safe as unbuffered logging. As we mentioned only the
transactions on the disk will be recovered during fast recovery. Even if
a transaction has been completed it may not be on the disk with buffered
logging.


> We have that too, although it's not on by default, and probably needs
> further tuning work.

Not sure we do.  We have something that delays the transaction hoping
for another one to come along, but we don't have something that writes
out WAL after X seconds or X transactions.  I think we need something
like that.

In fact, if we had it, I think we could remove the fsync=off option
entirely.  With fsync=off, an OS crash means you have to restore from
disk.  With WAL writes/fsync every 5 seconds, at least an OS crash
brings the system back to a stable state, and the performance with
fsync=off would be nearly identical.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026