Обсуждение: Reduce/eliminate the impact of FPW

Поиск
Список
Период
Сортировка

Reduce/eliminate the impact of FPW

От
Daniel Wood
Дата:
I thought that the biggest reason for the pgbench RW slowdown during a checkpoint was the flood of dirty page writes increasing the COMMIT latency.  It turns out that the documentation which states that FPW's start "after a checkpoint" really means after a CKPT starts.  And this is the really cause of the deep dip in performance.  Maybe only I was fooled... :-)

If we can't eliminate FPW's can we at least solve the impact of it?  Instead of writing the before images of pages inline into the WAL, which increases the COMMIT latency, write these same images to a separate physical log file.  The key idea is that I don't believe that COMMIT's require these buffers to be immediately flushed to the physical log.  We only need to flush these before the dirty pages are written.  This delay allows the physical before image IO's to be decoupled and done in an efficient manner without an impact to COMMIT's.

  1. When we generate a physical image add it to an in memory buffer of before page images.
  2. Put the physical log offset of the before image into the WAL record.  This is the current physical log file size plus the offset in the in-memory buffer of pages.
  3. Set a bit in the bufhdr indicating this was done.
  4. COMMIT's do not need to worry about those buffers.
  5. Periodically flush the in-memory buffer and clear the bit in the BufHdr.
  6. During any dirty page flushing if we see the bit set, which should be rare, then make sure we get our before image flushed.  This would be similar to our LSN based XLogFlush().
Do we need these before images for more than one CKPT?  I don't think so.  Do PITR's require before images since it is a continuous rollforward from a restore?  Just some of considerations.

Do I need to back this physical log up?  I likely(?) need to deal with replication.

Turning off FPW gives about a 20%, maybe more, boost on a pgbench TPC-B RW workload which fits in the buffer cache.  Can I get this 20% improvement with a separate physical log of before page images?

Doing IO's off on the side, but decoupled from the WAL stream, doesn't seem to impact COMMIT latency on modern SSD based storage systems.  For instance, you can hammer a shared data and WAL SSD filesystem with dirty page writes from the CKPT, at near the MAX IOPS of the SSD, and not impact COMMIT latency.  However, this presumes that the CKPT's natural spreading of dirty page writes across the CKPT target doesn't push too many outstanding IO's into the storage write Q on the OS/device.
NOTE: I don't believe the CKPT's throttling is perfect and I think a burst of dirty pages into the cache just before a CKPT might cause the Q to be flooded and this would then also further slow TPS during the CKPT.  But a fix to this is off topic from the FPW issue.

Thanks to Andres Freund for both making me aware of the Q depth impact on COMMIT latency and the hint that FPW might also be causing the CKPT slowdown.  FYI, I always knew about FPW slowdown in general but I just didn't realize it was THE primary cause of CKPT TPS slowdown on pgbench.  NOTE: I realize that spinning media might exhibit different behavior.  And I didn't not say dirty page writing has NO impact on good SSD's.  It depends, and this is a subject for a later date as I have a theory as to why I something see a sawtooth performance for pgbench TPC-B and sometimes a square wave but I want to prove if first.

Re: Reduce/eliminate the impact of FPW

От
Robert Haas
Дата:
On Mon, Aug 3, 2020 at 5:26 AM Daniel Wood <hexexpert@comcast.net> wrote:
> If we can't eliminate FPW's can we at least solve the impact of it?  Instead of writing the before images of pages
inlineinto the WAL, which increases the COMMIT latency, write these same images to a separate physical log file.  The
keyidea is that I don't believe that COMMIT's require these buffers to be immediately flushed to the physical log.  We
onlyneed to flush these before the dirty pages are written.  This delay allows the physical before image IO's to be
decoupledand done in an efficient manner without an impact to COMMIT's. 

I think this is what's called a double-write buffer, or what was tried
some years ago under that name.  A significant problem is that you
have to fsync() the double-write buffer before you can write the WAL.
So instead of this:

- write WAL to OS
- fsync WAL

You have to do this:

- write double-write buffer to OS
- fsync double-write buffer
- write WAL to OS
- fsync WAL

Note that you cannot overlap these steps -- the first fsync must be
completed before the second write can begin, else you might try to
replay WAL for which the double-write buffer information is not
available.

Because of this, I think this is actually quite expensive. COMMIT
requires the WAL to be flushed, unless you configure
synchronous_commit=off. So this would double the number of fsyncs we
have to do. It's not as bad as all that, because the individual fsyncs
would be smaller, and that makes a significant difference. For a big
transaction that writes a lot of WAL, you'd probably not notice much
difference; instead of writing 1000 pages to WAL, you might write 770
pages to the double-write buffer and 270 to the double-write buffer,
or something like that. But for short transactions, such as those
performed by pgbench, you'd probably end up with a lot of cases where
you had to write 3 pages instead of 2, and not only that, but the
writes have to be consecutive rather than simultaneous, and to
different parts of the disk rather than sequential. That would likely
suck a lot.

It's entirely possible that these kinds of problems could be mitigated
through really good engineering, maybe to the point where this kind of
solution outperforms what we have now for some or even all workloads,
but it seems equally possible that it's just always a loser. I don't
really know. It seems like a very difficult project.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Reduce/eliminate the impact of FPW

От
Daniel Wood
Дата:
> On 08/03/2020 8:26 AM Robert Haas <robertmhaas@gmail.com> wrote:
...
> I think this is what's called a double-write buffer, or what was tried
> some years ago under that name.  A significant problem is that you
> have to fsync() the double-write buffer before you can write the WAL.

I don't think it does need to be fsync'ed before the WAL.  If the
log record has a FPW reference beyond the physical log EOF then we
don't need to restore the before image because we haven't yet did
the dirty page write from the cache.  The before image only needs
to be flushed before the dirty page write.  Usually this will have
already done.

> ... But for short transactions, such as those
> performed by pgbench, you'd probably end up with a lot of cases where
> you had to write 3 pages instead of 2, and not only that, but the
> writes have to be consecutive rather than simultaneous, and to
> different parts of the disk rather than sequential. That would likely
> suck a lot.

Wherever you write the before images, in the WAL or into a separate
file you would write the same number of pages.  I don't understand
the 3 pages vs 2 pages comment.

And, "different parts of the disk"???  I wouldn't enable the feature
on spinning media unless I had a dedicated disk for it.

NOTE:
If the 90's Informix called this the physical log.  Restoring at
crash time restored physical consistency after which redo/undo
recovery achieved logical consistency.  From their doc's:
    "If the before-image of a modified page is stored in the physical-log buffer, it is eventually flushed from the
physical-logbuffer to the physical log on disk. The before-image of the page plays a critical role in restoring data
andfast recovery. For more details, see Physical-Log Buffer."
 

> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: Reduce/eliminate the impact of FPW

От
SATYANARAYANA NARLAPURAM
Дата:
Increasing checkpoint_timeout helps reduce the amount of log written to the disk. This has several benefits like, reduced number of WAL IO, archival load on the system, less network traffic to the standby replicas. However, this increases the crash recovery time and impact server availability. Investing in parallel recovery for Postgres helps reduce the crash recovery time and allows us to change the checkpoint frequency to much higher value? This idea is orthogonal to the double write improvements mentioned in the thread. Thomas Munro has a patch of doing page prefetching during recovery which speeds up recovery if the working set doesn't fit in the memory, we also need parallel recovery to replay huge amounts of WAL, when the working set is in memory.

Thanks,
Satya

On Mon, Aug 3, 2020 at 11:14 AM Daniel Wood <hexexpert@comcast.net> wrote:

> On 08/03/2020 8:26 AM Robert Haas <robertmhaas@gmail.com> wrote:
...
> I think this is what's called a double-write buffer, or what was tried
> some years ago under that name.  A significant problem is that you
> have to fsync() the double-write buffer before you can write the WAL.

I don't think it does need to be fsync'ed before the WAL.  If the
log record has a FPW reference beyond the physical log EOF then we
don't need to restore the before image because we haven't yet did
the dirty page write from the cache.  The before image only needs
to be flushed before the dirty page write.  Usually this will have
already done.

> ... But for short transactions, such as those
> performed by pgbench, you'd probably end up with a lot of cases where
> you had to write 3 pages instead of 2, and not only that, but the
> writes have to be consecutive rather than simultaneous, and to
> different parts of the disk rather than sequential. That would likely
> suck a lot.

Wherever you write the before images, in the WAL or into a separate
file you would write the same number of pages.  I don't understand
the 3 pages vs 2 pages comment.

And, "different parts of the disk"???  I wouldn't enable the feature
on spinning media unless I had a dedicated disk for it.

NOTE:
If the 90's Informix called this the physical log.  Restoring at
crash time restored physical consistency after which redo/undo
recovery achieved logical consistency.  From their doc's:
    "If the before-image of a modified page is stored in the physical-log buffer, it is eventually flushed from the physical-log buffer to the physical log on disk. The before-image of the page plays a critical role in restoring data and fast recovery. For more details, see Physical-Log Buffer."

> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


Re: Reduce/eliminate the impact of FPW

От
Stephen Frost
Дата:
Greetings,

Please don't top-post on these lists.

* SATYANARAYANA NARLAPURAM (satyanarlapuram@gmail.com) wrote:
> Increasing checkpoint_timeout helps reduce the amount of log written to the
> disk. This has several benefits like, reduced number of WAL IO, archival
> load on the system, less network traffic to the standby replicas. However,
> this increases the crash recovery time and impact server availability.

Sure.

> Investing in parallel recovery for Postgres helps reduce the crash recovery
> time and allows us to change the checkpoint frequency to much higher value?

Parallel recovery is a nice idea but it's pretty far from trivial..  Did
you have thoughts about how that would be accomplished?

> This idea is orthogonal to the double write improvements mentioned in the
> thread. Thomas Munro has a patch of doing page prefetching during recovery
> which speeds up recovery if the working set doesn't fit in the memory, we
> also need parallel recovery to replay huge amounts of WAL, when the working
> set is in memory.

What OS, filesystem, etc, are you running where you're seeing that the
WAL pre-fetch is helping to speed up recovery?  Based on prior
discussion, that seemed to help primarily on ZFS due to the block size
being larger than our block size, which, while somewhat interesting,
isn't as exciting as finding a way to speed up recovery across the
board.

Thanks,

Stephen

Вложения