Re: POC: Cleaning up orphaned files using undo logs

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: POC: Cleaning up orphaned files using undo logs
Дата
Msg-id 9d06265c-356a-9ee7-dc22-c21a2353f986@iki.fi
обсуждение исходный текст
Ответ на Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
Re: POC: Cleaning up orphaned files using undo logs  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
I had a look at the UNDO patches at 
https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com, 
and at the patch to use the UNDO logs to clean up orphaned files, from 
undo-2019-05-10.tgz earlier in this thread. Are these the latest ones to 
review?

Thanks Thomas and Amit and others for working on this! Orphaned relfiles 
has been an ugly wart forever. It's a small thing, but really nice to 
fix that finally. This has been a long thread, and I haven't read it 
all, so please forgive me if I repeat stuff that's already been discussed.

There are similar issues in CREATE/DROP DATABASE code. If you crash in 
the middle of CREATE DATABASE, you can be left with orphaned files in 
the data directory, or if you crash in the middle of DROP DATABASE, the 
data might be gone already but the pg_database entry is still there. We 
should plug those holes too.

There's a lot of stuff in the patches that are not relevant for cleaning 
up orphaned files. I know this cleaning up orphaned files work is mainly 
a vehicle to get the UNDO log committed, so that's expected. If we only 
cared about orphaned files, I'm sure the patches wouldn't spend so much 
effort on concurrency, for example. Nevertheless, I think we should 
leave out some stuff that's clearly unused, for now. For example, a 
bunch of fields in the record format: uur_block, uur_offset, uur_tuple. 
You can add them later, as part of the patches that actually need them, 
but for now they just make the patch larger to review.

Some more thoughts on the record format:

I feel that the level of abstraction is not quite right. There are a 
bunch of fields, like uur_block, uur_offset, uur_tuple, that are 
probably useful for some UNDO resource managers (zheap I presume), but 
seem kind of arbitrary. How is uur_tuple different from uur_payload? 
Should they be named more generically as uur_payload1 and uur_payload2? 
And why two, why not three or four different payloads? In the WAL record 
format, there's a concept of "block id", which allows you to store N 
number of different payloads in the record, I think that would be a 
better approach. Or only have one payload, and let the resource manager 
code divide it as it sees fit.

Many of the fields support a primitive type of compression, where a 
field can be omitted if it has the same value as on the first record on 
an UNDO page. That's handy. But again I don't like the fact that the 
fields have been hard-coded into the UNDO record format. I can see e.g. 
the relation oid to be useful for many AMs. But not all. And other AMs 
might well want to store and deduplicate other things, aside from the 
fields that are in the patch now. I'd like to move most of the fields to 
AM specific code, and somehow generalize the compression. One approach 
would be to let the AM store an arbitrary struct, and run it through a 
general-purpose compression algorithm, using the UNDO page's first 
record as the "dictionary". Or make the UNDO page's first record 
available in whole to the AM specific code, and let the AM do the 
deduplication. For cleaning up orphaned files, though, we don't really 
care about any of that, so I'd recommend just ripping it out for now. 
Compression/deduplication can be added later as a separate patch.

The orphaned-file cleanup patch doesn't actually use the uur_reloid 
field. It stores the RelFileNode instead, in the paylod. I think that's 
further evidence that the hard-coded fields in the record format are not 
quite right.


I don't like the way UndoFetchRecord returns a palloc'd 
UnpackedUndoRecord. I would prefer something similar to the xlogreader 
API, where a new call to UndoFetchRecord invalidates the previous 
result. On efficiency grounds, to avoid the palloc, but also to be 
consistent with xlogreader.

In the UNDO page header, there are a bunch of fields like 
pd_lower/pd_upper/pd_special that are copied from the "standard" page 
header, that are unused. There's a FIXME comment about that too. Let's 
remove them, there's no need for UNDO pages to look like standard 
relation pages. The LSN needs to be at the beginning, to work with the 
buffer manager, but that's the only requirement.

Could we leave out the UNDO and discard worker processes for now? 
Execute all UNDO actions immediately at rollback, and after crash 
recovery. That would be fine for cleaning up orphaned files, and it 
would cut down the size of the patch to review.

Can this race condition happen: Transaction A creates a table and an 
UNDO record to remember it. The transaction is rolled back, and the file 
is removed. Another transaction, B, creates a different table, and 
chooses the same relfilenode. It loads the table with data, and commits. 
Then the system crashes. After crash recovery, the UNDO record for the 
first transaction is applied, and it removes the file that belongs to 
the second table, created by transaction B.

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: Problem with default partition pruning
Следующее
От: vignesh C
Дата:
Сообщение: Re: Unused header file inclusion