Re: POC: Cleaning up orphaned files using undo logs

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: POC: Cleaning up orphaned files using undo logs
Дата
Msg-id 20190814163951.yrztf32j6svq5j3r@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: POC: Cleaning up orphaned files using undo logs  (Dilip Kumar <dilipbalaut@gmail.com>)
Ответы Re: POC: Cleaning up orphaned files using undo logs  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi,

On 2019-08-13 17:05:27 +0530, Dilip Kumar wrote:
> On Mon, Aug 5, 2019 at 11:59 PM Andres Freund <andres@anarazel.de> wrote:
> > (as I was out of context due to dealing with bugs, I've switched to
> > lookign at the current zheap/undoprocessing branch.
> >
> > On 2019-07-30 01:02:20 -0700, Andres Freund wrote:
> > > +/*
> > > + * Insert a previously-prepared undo records.
> > > + *
> > > + * This function will write the actual undo record into the buffers which are
> > > + * already pinned and locked in PreparedUndoInsert, and mark them dirty.  This
> > > + * step should be performed inside a critical section.
> > > + */
> >
> > Again, I think it's not ok to just assume you can lock an essentially
> > unbounded number of buffers. This seems almost guaranteed to result in
> > deadlocks. And there's limits on how many lwlocks one can hold etc.
> 
> I think for controlling that we need to put a limit on max prepared
> undo?  I am not sure any other way of limiting the number of buffers
> because we must lock all the buffer in which we are going to insert
> the undo record under one WAL logged operation.

I heard that a number of times. But I still don't know why that'd
actually be true. Why would it not be sufficient to just lock the buffer
currently being written to, rather than all buffers? It'd require a bit
of care updating the official current "logical end" of a log, but
otherwise ought to not be particularly hard? Only one backend can extend
the log after all, and until the log is externally visibily extended,
nobody can read or write those buffers, no?


> >
> > As far as I can tell there's simply no deadlock avoidance scheme in use
> > here *at all*? I must be missing something.
> 
> We are always locking buffer in block order so I am not sure how it
> can deadlock?  Am I missing something?

Do we really in all circumstances? Note that we update the transinfo
(and also progress) from earlier in the log. But my main point is more
that there's no documented deadlock avoidance scheme. Which imo means
there's none, because nobody will know to maintain it.



> > > +                     /*
> > > +                      * During recovery, there might be some blocks which are already
> > > +                      * deleted due to some discard command so we can just skip
> > > +                      * inserting into those blocks.
> > > +                      */
> > > +                     if (!BufferIsValid(buffer))
> > > +                     {
> > > +                             Assert(InRecovery);
> > > +
> > > +                             /*
> > > +                              * Skip actual writing just update the context so that we have
> > > +                              * write offset for inserting into next blocks.
> > > +                              */
> > > +                             SkipInsertingUndoData(&ucontext, BLCKSZ - starting_byte);
> > > +                             if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> > > +                                     break;
> > > +                     }
> >
> > How exactly can this happen?
> 
> Suppose you insert one record for the transaction which split in
> block1 and 2.  Now, before this block is actually going to the disk
> the transaction committed and become all visible the undo logs are
> discarded.  It's possible that block 1 is completely discarded but
> block 2 is not because it might have undo for the next transaction.
> Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> their so we need to skip inserting undo for block 1 as it does not
> exist.

Hm. I'm quite doubtful this is a good idea. How will this not force us
to a emit a lot more expensive durable operations while writing undo?
And doesn't this reduce error detection quite remarkably?

Thomas, Robert?





> > > +                     /* Read the undo record. */
> > > +                     UndoGetOneRecord(uur, urecptr, rnode, category, &buffer);
> > > +
> > > +                     /* Release the discard lock after fetching the record. */
> > > +                     if (!InHotStandby)
> > > +                             LWLockRelease(&slot->discard_lock);
> > > +             }
> > > +             else
> > > +                     UndoGetOneRecord(uur, urecptr, rnode, category, &buffer);
> >
> >
> > And then we do none of this in !one_page mode.
> UndoBulkFetchRecord is always called from the aborted transaction so
> its undo can never get discarded concurrently so ideally, we don't
> need to check for discard.

That's an undocumented assumption. Why would anybody reading the
interface know that?



> > > +static uint16
> > > +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
> > > +                                      UndoLogCategory category)
> > > +{
> > > +     UndoLogOffset page_offset = UndoRecPtrGetPageOffset(urp);
> > > +     BlockNumber cur_blk = UndoRecPtrGetBlockNum(urp);
> > > +     Buffer          buffer = input_buffer;
> > > +     Page            page = NULL;
> > > +     char       *pagedata = NULL;
> > > +     char            prevlen[2];
> > > +     RelFileNode rnode;
> > > +     int                     byte_to_read = sizeof(uint16);
> >
> > Shouldn't it be byte_to_read?

Err, *bytes*_to_read.


> > And the sizeof a type that's tied with the actual undo format?
> > Imagine we'd ever want to change the length format for undo records
> > - this would be hard to find.
> 
> Do you mean that we should not rely on undo format i.e. we should not
> assume that undo length is stored at the end of the undo record?

I was referencing the use of sizeof(uint16). I think this should either
reference an UndoRecLen typedef or something like it, or use something
roughly like
#define member_size(type, member) (sizeof((type){0}.member))
and then have bytes_to_read be set to something like
member_size(PackedUndoRecord, len)

> > > +     char            persistence;
> > > +     uint16          prev_rec_len = 0;
> > > +
> > > +     /* Get relfilenode. */
> > > +     UndoRecPtrAssignRelFileNode(rnode, urp);
> > > +     persistence = RelPersistenceForUndoLogCategory(category);
> > > +
> > > +     if (BufferIsValid(buffer))
> > > +     {
> > > +             page = BufferGetPage(buffer);
> > > +             pagedata = (char *) page;
> > > +     }
> > > +
> > > +     /*
> > > +      * Length if the previous undo record is store at the end of that record
> > > +      * so just fetch last 2 bytes.
> > > +      */
> > > +     while (byte_to_read > 0)
> > > +     {
> >
> > Why does this need a loop around the number of bytes? Can there ever be
> > a case where this is split across a record? If so, isn't that a bad idea
> > anyway?
> Yes, as of now, undo record can be splitted at any point even the undo
> length can be split acorss 2 pages.  I think we can reduce complexity
> by making sure undo length doesn't get split acorss pages.

I think we definitely should do that. I'd probably even include more
than just the size in the header that's not allowed to be split across
pages.


> But for handling that while allocating the undo we need to detect this
> whether the undo length can get splitted by checking the space in the
> current page and the undo record length and based on that we need to
> allocate 1 extra byte in the undo log.  Seems that will add an extra
> complexity.

That seems fairly straightforward?

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: POC: Cleaning up orphaned files using undo logs
Следующее
От: Andres Freund
Дата:
Сообщение: Re: POC: Cleaning up orphaned files using undo logs