Re: POC: Cleaning up orphaned files using undo logs

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: POC: Cleaning up orphaned files using undo logs
Дата
Msg-id CAA4eK1J6KzJ50w8w4XAVdFg4=aNNK3vff1wM0CW5h55NNNqRgg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: POC: Cleaning up orphaned files using undo logs  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: POC: Cleaning up orphaned files using undo logs  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I had a look at the UNDO patches at
> https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com,
> and at the patch to use the UNDO logs to clean up orphaned files, from
> undo-2019-05-10.tgz earlier in this thread. Are these the latest ones to
> review?
>

Yes, I am not sure of cleanup orphan file patch (Thomas can confirm
the same), but others are latest.

> Thanks Thomas and Amit and others for working on this! Orphaned relfiles
> has been an ugly wart forever. It's a small thing, but really nice to
> fix that finally. This has been a long thread, and I haven't read it
> all, so please forgive me if I repeat stuff that's already been discussed.
>
> There are similar issues in CREATE/DROP DATABASE code. If you crash in
> the middle of CREATE DATABASE, you can be left with orphaned files in
> the data directory, or if you crash in the middle of DROP DATABASE, the
> data might be gone already but the pg_database entry is still there. We
> should plug those holes too.
>

+1. Interesting.

> There's a lot of stuff in the patches that are not relevant for cleaning
> up orphaned files. I know this cleaning up orphaned files work is mainly
> a vehicle to get the UNDO log committed, so that's expected. If we only
> cared about orphaned files, I'm sure the patches wouldn't spend so much
> effort on concurrency, for example. Nevertheless, I think we should
> leave out some stuff that's clearly unused, for now. For example, a
> bunch of fields in the record format: uur_block, uur_offset, uur_tuple.
> You can add them later, as part of the patches that actually need them,
> but for now they just make the patch larger to review.
>
> Some more thoughts on the record format:
>
> I feel that the level of abstraction is not quite right. There are a
> bunch of fields, like uur_block, uur_offset, uur_tuple, that are
> probably useful for some UNDO resource managers (zheap I presume), but
> seem kind of arbitrary. How is uur_tuple different from uur_payload?
>

The uur_tuple field can only store tuple whereas uur_payload can have
miscellaneous information.  For ex. in zheap, we store transaction
information like CID, CTID, some information related to TPD, etc. in
the payload.  Basically, I think eventually payload will have some
bitmap to indicate what all is stored in it.  OTOH, I agree that if we
want we can store tuple as well in the payload.

> Should they be named more generically as uur_payload1 and uur_payload2?
> And why two, why not three or four different payloads? In the WAL record
> format, there's a concept of "block id", which allows you to store N
> number of different payloads in the record, I think that would be a
> better approach. Or only have one payload, and let the resource manager
> code divide it as it sees fit.
>

For payload, something like what you describe here sounds like a good
idea, but I feel we can have tuple as a separate field.  It will help
in accessing tuple quickly and easily during visibility or rollbacks
for some AM's like zheap.

> Many of the fields support a primitive type of compression, where a
> field can be omitted if it has the same value as on the first record on
> an UNDO page. That's handy. But again I don't like the fact that the
> fields have been hard-coded into the UNDO record format. I can see e.g.
> the relation oid to be useful for many AMs. But not all. And other AMs
> might well want to store and deduplicate other things, aside from the
> fields that are in the patch now. I'd like to move most of the fields to
> AM specific code, and somehow generalize the compression. One approach
> would be to let the AM store an arbitrary struct, and run it through a
> general-purpose compression algorithm, using the UNDO page's first
> record as the "dictionary". Or make the UNDO page's first record
> available in whole to the AM specific code, and let the AM do the
> deduplication. For cleaning up orphaned files, though, we don't really
> care about any of that, so I'd recommend just ripping it out for now.
> Compression/deduplication can be added later as a separate patch.
>

I think this will make the undorecord-interface patch a bit simpler as well.

> The orphaned-file cleanup patch doesn't actually use the uur_reloid
> field. It stores the RelFileNode instead, in the paylod. I think that's
> further evidence that the hard-coded fields in the record format are not
> quite right.
>
>
> I don't like the way UndoFetchRecord returns a palloc'd
> UnpackedUndoRecord. I would prefer something similar to the xlogreader
> API, where a new call to UndoFetchRecord invalidates the previous
> result. On efficiency grounds, to avoid the palloc, but also to be
> consistent with xlogreader.
>
> In the UNDO page header, there are a bunch of fields like
> pd_lower/pd_upper/pd_special that are copied from the "standard" page
> header, that are unused. There's a FIXME comment about that too. Let's
> remove them, there's no need for UNDO pages to look like standard
> relation pages. The LSN needs to be at the beginning, to work with the
> buffer manager, but that's the only requirement.
>
> Could we leave out the UNDO and discard worker processes for now?
> Execute all UNDO actions immediately at rollback, and after crash
> recovery. That would be fine for cleaning up orphaned files,
>

Even if we execute all the undo actions on rollback, we need discard
worker to discard undo at regular intervals.  Also, what if we get an
error while applying undo actions during rollback?  Right now, we have
a mechanism to push such a request to background worker and allow the
session to continue.  Instead, we might want to Panic in such cases if
we don't want to have background undo workers.

> and it
> would cut down the size of the patch to review.
>

If we can find some way to handle all cases and everyone agrees to it,
that would be good. In fact, we can try to get the basic stuff
committed first and then try to get the rest (undo-worker machinery)
done.


> Can this race condition happen: Transaction A creates a table and an
> UNDO record to remember it. The transaction is rolled back, and the file
> is removed. Another transaction, B, creates a different table, and
> chooses the same relfilenode. It loads the table with data, and commits.
> Then the system crashes. After crash recovery, the UNDO record for the
> first transaction is applied, and it removes the file that belongs to
> the second table, created by transaction B.
>

I don't think such a race exists, but we should verify it once.
Basically, once the rollback is complete, we mark the transaction
rollback as complete in the transaction header in undo and write a WAL
for it.  After crash-recovery, we will skip such a transaction.  Isn't
that sufficient to prevent such a race condition?

Thank you for looking into this work.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jeevan Ladhe
Дата:
Сообщение: Re: concerns around pg_lsn
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: POC: Cleaning up orphaned files using undo logs