Journal based VACUUM FULL

Поиск
Список
Период
Сортировка
От Ryan David Sheasby
Тема Journal based VACUUM FULL
Дата
Msg-id CANTTaev-LdgYj4uZoy67catS5SF5u_X-dTHiLH7OKwU6Gv3MFA@mail.gmail.com
обсуждение исходный текст
Ответы Re: Journal based VACUUM FULL  (Andreas Karlsson <andreas@proxel.se>)
Список pgsql-hackers
Hi Team.

New to contributing so hopefully this is the right place. I've searched the forum and it seems this is the place for feature requests/suggestions.

I was reading on VACUUM and VACUUM FULL and saw that the current implementation of VACUUM FULL writes to an entirely new file and then switches to it, as opposed to rewriting the current file in-place. I assume the reason for this is safety in the event the database shuts down in the middle of the vacuum, the most you will lose is the progress of the vacuum and have to start from scratch but otherwise the database will retain its integrity and not become corrupted. This seems to be an effective but somewhat rudimentary system that could be improved. Most notably, the rewriting of almost the entire database takes up basically double the storage during the duration of the rewrite which can become expensive or even simply inconvenient in IaaS(and probably other) installations where the drive sizes are scaled on demand. Even if the VACUUM FULL doesn't need to run often, having to reallocate drive space for an effective duplication is not ideal. My suggestion is a journal based in-place rewrite of the database files.

This means that the VACUUM FULL will do a "pre-processing" pass over the database and figure out at a fairly low level what operations need to be done to compact the database back to it's "correct" size. These operations will be written in their entirety to a journal which records all the operations about to be performed, with some mechanism for checking if they have already been performed, using the same principle described here: https://en.wikipedia.org/wiki/Journaling_file_system. This will allow an in-place rewrite of the file in a safe manner such that you are able to recover from an unexpected shutdown by resuming the VACUUM FULL from the journal, or by detecting where the copy hole is in the file and recording/ignoring it

The journal could be something as simple as a record of which byte ranges need to be copied into which other byte ranges locations. The journal should record whenever a byte range copy completes for the sake of error recovery. Obviously, each byte range will have a max size of the copy distance from the source to the destination so that the destination will not overwrite the source, therefore making recovery impossible(how can you know where in the copy you stopped?). However, this will have a snowball effect as the further you are in the rewrite, the further the source and destination ranges will be so you can copy bigger chunks at a time, and won't have to update the journal's completion flags as often. In the case of a shutdown during a copy, you merely read the journal, looking for the first copy that isn't completed yet, and continue rewriting from there. Even if some of the bytes have been copied already, there will be no corruption as you haven't overwritten the source bytes at all. Finally, a simple file truncate can take place once all the copies are complete, and the journal can be deleted. This means the headroom required in the filesystem would be much smaller, and would pay for itself in any copy of at least 17 bytes or more (assuming 2*8 byte pointers plus a bit as a completion flag). The only situation in which this system would consume more space than a total copy is if the database has more than 50% garbage, and the garbage is perfectly spread out i.e. isn't in large chunks that can be copied at once and therefore recorded in the journal at once, and each piece of garbage is smaller than 17 bytes. Obviously, the journal itself would need a error checking mechanism to ensure the journal was correctly and completely written, but this can be as simple as a total file hash at the end of the file.

An alternative to the completion flags is to compute a hash of the data to be copied and store it in the journal, and then in recovery you can compare the destination with the hash. This has the advantage of not needing to write to the journal to keep it up to date during the operations, but the disadvantages associated with having to compute many hashes while recovering and storing the hashes in the journal, taking up more space. It's also arguably less safe as there is always the chance(albeit extremely unlikely) of a collision, which would mean that the data is not actually validated. I would argue the risk of this is lower than the risk of bit-rot flipping the completion bit, however.

A journaling system like this *might* have performance benefits too, specifically when running in less intelligent file systems like NTFS which can become easily fragmented(causing potentially big performance issues on spinning disks). Rewriting the same file will never require a file-system de-fragment. The other advantage as mentioned before is in the case of auto-scaling drives if used as storage for the DB(as they often are in IaaS/Paas services). Not having to scale up rapidly could be a performance boost in some cases.

Finally, a journaling system like this will also lend itself to stopping/resuming in the middle of the VACUUM FULL. Once the journal is created and the rewrites have started, assuming the journal "completion" flag is kept up to date, you can stop the operation in the middle(presumably writing the current "gap" with null bytes or otherwise indicating to the DB that there's a gap in the middle that should be ignored), and continue using the database as usual. This means you can do a "soft" recovery wherein the database is only halfway vacuumed but it's till perfectly operational and functional. You can also resume from this soft recovery by simply continuing to write from the last copy that was completed. Obviously you will only regain disk space when you reach the end and truncate the file but you are at least able to pause/resume the operation, waiting only for the current copy block to finish instead of for the entire VACUUM FULL to finish.

I hope this was a coherent explanation of my suggestion. It's possible and maybe even likely that there's a glaring misunderstanding or assumption on my part that means this isn't practical, but I'd still love to get feedback on it.

Thanks
Ryan Sheasby

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Pluggable Storage - Andres's take
Следующее
От: Shawn Debnath
Дата:
Сообщение: Re: Refactoring the checkpointer's fsync request queue