Re: POC: Cleaning up orphaned files using undo logs

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: POC: Cleaning up orphaned files using undo logs
Дата
Msg-id CA+TgmoZAh=THs=r10o-Ze37XMmdh9jqPWBcT6R-A9btuQLPavQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: POC: Cleaning up orphaned files using undo logs  (Dilip Kumar <dilipbalaut@gmail.com>)
Список pgsql-hackers
On Thu, Apr 25, 2019 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > +static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
> > +static int buffer_idx;
> >
> > This is a file-global variable with a very generic name that is very
> > similar to a local variable name used by multiple functions within the
> > file (bufidx) and no comments.  Ugh.
> >
> Comment added.

The variable name is still bad, and the comment isn't very helpful
either.  First, you can't tell by looking at the name that it has
anything to do with the undo_buffers variable, because undo_buffers
and buffer_idx are not obviously related names.  Second, it's not an
index; it's a count.  A count tells you how many of something you
have; an index tells you which one of those you are presently thinking
about.  Third, the undo_buffer array is itself poorly named, because
it's not an array of all the undo buffers in the world or anything
like that, but rather an array of undo buffers for some particular
purpose.  "static UndoBuffers *undo_buffer" is about as helpful as
"int integer" and I hope I don't need to explain why that isn't
usually a good thing to write.  Maybe prepared_undo_buffer for the
array and nprepared_undo_buffer for the count, or something like that.

> I think the simple solution will be that inside UndoRecordIsValid
> function we can directly check UndoLogIsDiscarded if oldest_data is
> not yet initialized, I think we don't need to release the discard lock
> for that.  So I have changed it like that.

Can we instead eliminate the special case?  It seems like the if
(log->oldest_data == InvalidUndoRecPtr) case will be taken very
rarely, so if it's buggy, we might not notice.

> Right, from uur_prevxid we would know that for which xid's undo we are
> looking for but without having uur_xid in undo record it self how we
> would know which undo record is inserted by the xid we are looking
> for?  Because in zheap while following the undo chain and if slot got
> switch, then there is possibility (because of slot reuse) that we
> might get some other transaction's undo record for the same zheap
> tuple, but we want to traverse back as we want to find the record
> inserted by uur_prevxid.  So we need uur_xid as well to tell who is
> inserter of this undo record?

It seems desirable to me to make this the caller's problem.  When we
are processing undo a transaction at a time, we'll know the XID
because it will be available from the transaction header.  If a system
like zheap maintains a pointer to an undo record someplace in the
middle of a transaction, it can also store the XID if it needs it.
The thing is, the zheap code almost works that way already.
Transaction slots within a page store both the undo record pointer and
the XID.  The only case where zheap doesn't store the undo record
pointer and the XID is when a slot switch occurs, but that could be
changed.

If we moved urec_xid into UndoRecordTransaction, we'd save 4 bytes per
undo record across the board.  When zheap emits undo records for
updates or deletes, they would need store an UndoRecPtr (8 bytes) +
FullTransactionId (8 bytes) in the payload unless the previous change
to that TID is all-visible or the previous change to that TID was made
by the same transaction.  Also, zheap would no longer need to store
the slot number in the payload in any case, because this would
substitute for that (and permit more efficient lookups, to boot).  So
the overall impact on zheap update and delete records would be
somewhere between -4 bytes (when we save the space used by XID and
incur no other cost) and +12 bytes (when we lose the XID but gain the
UndoRecPtr + FullTransactionId).

That worst case could be further optimized.  For example, instead of
storing a FullTransactionId, zheap could store the difference between
the XID to which the current record pertains (which in this model the
caller is required to know) and the XID of whoever last modified the
tuple. That difference certainly can't exceed 4 billion (or even 2
billion) so 4 bytes is enough.  That reduces the worst-case difference
to +8 bytes.  Probably zheap could use payloads with some kind of
variable-length encoding and squeeze out even more in common cases,
but I'm not sure that's necessary or worth the complexity.

Let's also give uur_blkprev its own UREC_INFO_* constant and omit it
when this is the first time we're touching this block in this
transaction and thus the value is InvalidUndoRecPtr.  In the
pretty-common case where a transaction updates one tuple on the page
and never comes back, this - together with the optimization in the
previous paragraph - will cause zheap to come out even on undo,
because it'll save 4 bytes by omitting urec_xid and 8 bytes by
omitting uur_blkrev, and it'll lose 8 bytes storing an UndoRecPtr in
the payload and 4 bytes storing an XID-difference.

Even with those changes, zheap's update and delete could still come
out a little behind on undo volume if hitting many tuples on the same
page, because for every tuple they hit after the first, we'll still
need the UndoRecPtr for the previous change to that page (uur_blkprev)
and we'll also have the UndoRecPtr extracted from the tuple's previous
slot, store in the payload.  So we'll end up +8 bytes in this case.  I
think that's acceptable, because it often won't happen, it's hardly
catastrophic if it does, and saving 4 bytes on every insert, and on
every update or delete where the old undo is already discarded is
pretty sweet.

> Yeah so we can directly jump to the header which is not yet completely
> read but if any of the header is partially read then we need to
> maintain some kind of partial read variable otherwise from 'already
> read' we wouldn't be able to know how many bytes of the header got
> read in last call unless we calculate that from uur_info or maintain
> the partial_read in context like I have done in the new version.

Right, we need to know the bytes already read for the next header.

> Note:
> - I think the ucontext->stage are same for the insert and DECODE can
> we just declare only
>   one enum and give some generic name e.g. UNDO_PROCESS_STAGE_HEADER ?

I agree.  Maybe UNDO_PACK_STAGE_WHATEVER or, more briefly, UNDO_PACK_WHATEVER.

> - In SkipInsertingUndoData also I have to go through all the stages so
> that if we find some valid
> block then stage is right for inserting the partial record?  Do you
> think I could have avoided that?

Hmm, I didn't foresee that, but no, I don't think you can avoid that.
That problem wouldn't occur before we added the stage stuff, since
we'd just go through all the stages every time and each one would know
its own size and do nothing if that number of bytes had already been
passed, but with this design there seems to be no way around it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Melanie Plageman
Дата:
Сообщение: Re: Calling PrepareTempTablespaces in BufFileCreateTemp
Следующее
От: Ashwin Agrawal
Дата:
Сообщение: Re: Calling PrepareTempTablespaces in BufFileCreateTemp