Re: postgres large database backup

Поиск
Список
Период
Сортировка
От Michael Loftis
Тема Re: postgres large database backup
Дата
Msg-id CAHDg04sGAfTjsDO1Gt6T4Eq5K5X_Emtma+4iZUBMCBF5bWQJWA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: postgres large database backup  (Mladen Gogala <gogala.mladen@gmail.com>)
Ответы Re: postgres large database backup  (Michael Loftis <mloftis@wgops.com>)
Список pgsql-general


On Thu, Dec 1, 2022 at 06:40 Mladen Gogala <gogala.mladen@gmail.com> wrote:
On 11/30/22 20:41, Michael Loftis wrote:

ZFS snapshots don’t typically have much if  any performance impact versus not having a snapshot (and already being on ZFS) because it’s already doing COW style semantics. 

Hi Michael,

I am not sure that such statement holds water. When a snapshot is taken, the amount of necessary I/O requests goes up dramatically. For every block that snapshot points to, it is necessary to read the block, write it to the spare location and then overwrite it, if you want to write to a block pointed by snapshot. That gives 3 I/O requests for every block written. NetApp is trying to optimize it by using 64MB blocks, but ZFS on Linux cannot do that, they have to use standard CoW because they don't have the benefit of their own hardware and OS. And the standard CoW is tripling the number of I/O requests for every write to the blocks pointed to by the snapshot, for every snapshot. CoW is a very expensive animal, with horns.


Nope, ZFS does not behave that way.  Yup AFAIK all other snapshotting filesystems or volume managers do.  One major architectural decision of ZFS is the atomicity of writes.  Data at rest stays at rest.  Thus it does NOT overwrite live data.  Snapshots do not change the write path/behavior in ZFS. In ZFS writes are atomic, you’re always writing new data to free space, and accounting for where the current record/volume block within a file or volume actually lives on disk.  If a filesystem, volume manager, or RAID system, is overwriting data and in the middle of that process and has an issue that breaks that write, and that data is also live data, you can't be atomic, you've now destroyed data (RAID write hole is one concept of this).  That’s why adding a snapshot isn’t an additional cost for ZFS.  For better or worse you're paying that snapshot cost already because it already does not overwrite live data.  If there's no snapshot once the write is committed and the refcount is zero for the old blocks, and it's safe (TXG committed), those old blocks go back to the free pool to be potentially used again.  There's a bunch of optimization to that and how it actually happens, but at the end of the day, your writes do not overwrite your data in ZFS, writes of data get directed at free space, and eventually the on-disk structures get an atomic update that happens to say it now lives here.  In the time between that all happening the ZIL (which may live on its own special devices called SLOG -- this is why you often see the terms ZIL/journal/SLOG/log vdev used interchangeably) is the durable bit, but that's never normally read, it's only read back during recovery.   This is also where the ZFS filesystem property of recordsize or volblocksize (independently configurable on every filesystem/volume within a pool) is important for performance.  If you clobber a whole record ZFS isn't going to read anything extra when it gets around to committing, it knows the whole record changed and can safely write a whole new record (every 5s it goes about this TXG commit, so two 64k writes are still slower with a 128k recordsize, but still shouldn't pull in that 128k record).  There's other optimizations there, but at the end of the day as long as the chosen recordsize/volblocksize that matches up to your writes, and your writes are aligned to that within your file or volume, you'll not see an extra read of the data as part of it's normal flow of committing data.  Snapshots don't change that.  

Because of those architectural decisions, CoW behavior is part of ZFS' existing performance penalty, so when you look at that older Oracle ASM vs ZFS article, remember that that extra...what was it 0.5ms?... is accounting for most, probably all of the penalties for a snapshot too if you want (or need) it.  It's fundamental to how ZFS works and provides data durability+atomicity.  This is what ZFS calls it's snapshots essentially free, because you're already paying the performance for it.   What would ASM do if it had a snapshot to manage?  Or a few dozen on the same data?  Obviously during the first writes to those snapshotted areas you'd see it.  Ongoing performance penalties with those snapshots? Maybe ASM has an optimization that saves that benchmark a bunch of time if there is no snapshot.  But once one exists it takes a different write path and adds a performance penalty?  If a snapshot was taken in the middle of the benchmark?  Yeah there's going to be some extra IOPS when you take the snapshot to say "a snapshot now exists" for ZFS, but that doesn't dramatically change it's underlying write path after that point.

That atomicity and data durability also means that even if you lose the SLOG devices (which hold the ZIL/journal, if you don't have SLOG/log vdev then it's in-pool) you do not lose all the data.  Only stuff that somehow remained uncommitted after the ZIL failed. Say you had some sort of hard fault/crash, the SLOG/ZIL devices were destroyed, you can still opt to mount the ZFS pool, and filesystems/volumes, without that ZIL, which could (well, would) still suck, but would be better than just losing everything.  If the ZIL fails while the system is live, ZFS is going to do it's best to hopefully get everything committed ASAP as soon as it knows something is wrong, and keep it that way.  So on a SLOG/ZIL failure your performance WILL suffer (and boy it's UGLY, but at least it's not dead and destroyed).  And because of the atomicity property even if it has further fails during that window of time where it scrambles to commit, ZFS does not wreck the filesystem.  If the devices are still available it'll still provide whatever data it can back to you.  

So there's a very different approach to what's important with ZFS, it's not that performance isn't important, it's that your data is more important.  Performance is NOT ignored, but, to get that atomicity and durability you ARE paying some performance costs.  Is that worth it for YOUR database or files?  Only you as an admin can decide that.  No ZFS is NOT a great choice for every database or dataset!  For some workloads that penalty is not going to be acceptable.  

So writes in ZFS are always towards the journal (ZIL) first, barring config property tweaks, once a journal entry is durable it’s considered written, but uncommitted. If we crash at that point then journal recovery brings the pool back to that last written journal entry so ZFS is never lying to the application or OS.

The data is always written back in an atomic manner, writing new or changed data to a free block, never to the existing block.  So if it bombs in the middle of THAT, you're fine.  When the whole (record size or vol block size) block of sectors is written at once or within a window of time there’s no read of the original data once it finishes coalescing and the writing process to fully commit it. There’s of course always the free space/reference counts. But if they haven’t changed they aren’t written, and they're needed anyway to find where the data lives so are already present, so yeah first write best case there’s an extra write in the area where that is kept (which may itself be coalesced with other writes towards that), but it's not like every write incurs that extra overhead to maintain the old block, and if your write replaces that whole recordsize sized block,  But after that until a snapshot is added or removed there’s no more changes for the references to that old block.  When there’s no more references to a given block it goes into the free pool. 

Having a snapshot doesn’t add more work. It’s a byproduct of the atomic write behavior that ZFS implements (always write to free blocks, always write a sector).  You're just asking ZFS to not free those blocks.


Regards

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

В списке pgsql-general по дате отправления:

Предыдущее
От: Laurenz Albe
Дата:
Сообщение: Re: Stored procedure code no longer stored in v14 and v15, changed behaviour
Следующее
От: Michael Loftis
Дата:
Сообщение: Re: postgres large database backup