Re: [HACKERS] UNDO and in-place update

Поиск
Список
Период
Сортировка
От	Amit Kapila
Тема	Re: [HACKERS] UNDO and in-place update
Дата	11 января 2017 г. 13:18:39
Msg-id	CAA4eK1LgJE+Z9y7Far3Ngponf4nF0M-Ch0zS2xyyRhyNhUtJJw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [HACKERS] UNDO and in-place update (Amit Kapila <amit.kapila16@gmail.com>)
Ответы	Re: [HACKERS] UNDO and in-place update
Список	pgsql-hackers
Дерево обсуждения
On Tue, Jan 10, 2017 at 7:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jan 9, 2017 at 11:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> Yes, something like this can be done.  You don't really need any new
>> page-level header data, because you can get the XIDs from the TPD
>> entry (or from the page itself if there's only one).  But you could
>> expand the single "is-modified" bit that I've proposed adding to each
>> tuple to multiple bits.  0 means not recently modified.  1 means
>> modified by the first or only transaction that has recently modified
>> the page.  2 means modified by the second transaction that has
>> recently modified the page.  Etc.
>>
>
> makes sense.
>
>> What I was thinking about doing instead is storing an array in the TPD
>> containing the same information.  There would be one byte or one half
>> a byte or whatever per TID and it would contain the index of the XID
>> in the TPD that had most recently modified or locked that TID.  Your
>> solution might be better, though, at least for cases where the number
>> of tuples that have modified the page is small.
>>
>
> I think we also need to prevent multiple backends trying to reserve a
> slot in this array which can be a point of contention.  Another point
> is during pruning, if due to row movement TIDs are changed, we need to
> keep this array in sync.
>

Moving further, I have thought about consistent reads and different
formats for storing undo with pros and cons of each format.

Consistent Reads
----------------------------
If the page is modified by any transaction that is not visible to read
transaction's snapshot, we have to reconstruct a copy of the page.
Now there could be multiple ways to achieve it:

1. Apply the undo for the transactions that are not visible to read
transaction and keep the copy of the page.
2. Apply the undo for all the transactions in TPD entry of a page.

I think we can't do second because this can create rows which are not
consistent due to apply of undo log for transactions which are visible
to read transaction.  Now, where the first approach will work, but it
might turn out to be inefficient if not implemented carefully, as for
each read transaction we need to create separate copies of the page
where a different set of transactions are not visible to different
read transactions. Even for read transactions to which the same set of
transactions are not visible, we need some way to recognize the
version of the page that can be used, probably try with all the
version of pages and see if any of the existing versions can serve the
purpose.  I think we also need some way to link the different versions
of the page in shared buffers so that before trying to create a new
version of the page we can look at linked versions.

Yet another consideration in this area could be to perform consistent
reads differently for index scans.  For index scan, we will receive a
particular TID to read, so, in this case, we can try to reconstruct
the old image of that particular row.  Now, whereas this sounds
efficient, but we might need to reconstruct the old image of rows
multiple times for different read transactions. The effect will be
magnificent when we have to perform this for multiple rows of a page.
I think it might be better to always construct a complete copy of page
if one or more transactions that have changed the page are not visible
to read snapshot.

Undo Page Format
-------------------------------
I think we can organize undo in multiple ways

(a) One undo file per transaction - We can name each undo file as
epoch+xid or epch+xid.undo. These files can be kept in pg_undo
directory under PGDATA.  In this approach, we have to store
latest_undo offset in file header to perform the rollback operation.
Also, this offset can be used to allocate the location for new undo
record.  Apart from that undo entries can be organized one after
another.  Each undo entry will have two back pointers, one to reach
the previous undo location which will be used for rollback and the
second one to maintain the prior undo chain for changes to a
particular page which will be used to reconstruct the copy of page.
During recovery,  we need to read each undo file and check the
transaction status in 'clog', if it is not committed, then apply all
the undo starting from latest_undo position.  I think we should
organize undo's in form of pages with a size similar to heap pages.
This will be helpful both for writing WAL and doing checkpoints for
undo where we expect pages.  The access to UNDO can be via shared
buffers. I think we might want to keep the access to shared buffers
containing undo separate from heap/index pages so that they can't be
evicted based on access of heap pages.  Different TPD entries
containing the same XID will point to different locations in undo.
Undo location will be just a page offset for a particular page entry.
Vacuum can delete or make the undo file reusable when the
corresponding transaction precedes RecentGlobalXmin.

To decide which files can be removed/reused, Vacuum either needs to
traverse the TPD entries or need to find by traversing the files in
pg_undo directory.  Here, along with the removal of undo file, we also
need some way to indicate that TPD entry/entries for that undo are
invalid. A simple way could be that if the corresponding undo file
doesn't exist, then we can consider it as invalid.  However, that
might not work if we choose to store undo of multiple transactions in
one undo file.

The main advantage of this approach is that this will allow removing
the *not* required undo space efficiently.  The disadvantage is that
creating different files for each transaction can be costly and it can
lead to lot of undo files when there are some long running
transactions, of course, we can optimize by allowing to remove the
files in such cases and give user error "snapshot_too_old", but I
think still it can be a matter of concern.  To alleviate this concern,
we can try to keep multiple transactions undo data in one file.

(b) Multiple transactions in one undo file - Here we have to choose
some arbitrary name for undo files, let's say undo_1, undo_2, etc. for
the sake of discussion.  These files can be kept in pg_undo directory
under PGDATA.  In this approach, we can choose to select set of four
or eight 8KB pages (page of the same size as heap page) per
transaction, we can call each such set of pages as one undo segment.
If one segment is not sufficient to store undo for one transaction,
then we can allocate next segment.  Each segment can have segment
header in the first page which will contain previous segment address.
For the very first segment, previous segment address will be Nil, the
second segment will contain the address of the first segment, third,
will contain the address of second and so on.  Previous segment
address is needed to replay undo for rollback.  I have explicitly
avoided storing the next segment address as we don't need it.  We also
need some form of freespace map to keep track of undo segments that
are freed.  Once the vacuum decided to free the undo of a particular
transaction, it can add all the undo segments for that particular
transaction in freespace map.  For allocating a new segment for
existing transaction or new transaction, we need to first search in
freespace map for the available segment, if not available, then
allocate a new segment. In the file, the first segment will contain
slots to hold the latest_undo location and XID (probably (epoch +
XID)) for each of the transaction which has undo in the file.  We need
latest_undo location to rollback the transaction and XID is required
to rollback the transaction at the time of recovery.  Let's call the
first segment as a HEAD segment. We can extend this HEAD segment if
there is more space in the file, but for simplicity and matter of
discussion, let's consider one undo file can hold the transactions
that can get a slot in the HEAD segment.  Each slot can be 6 or 8
bytes long.  Considering each segment of size 32KB, we can hold undo
of ~4096 transactions in one file. We can keep some book-keeping
information to examine whether there is any empty slot in the first
segment of the file. So whenever a new transaction has to allocate an
undo, it checks in a HEAD segment of undo_1, if there is a space to
allocate new slot, allocate it, else try in next file undo_2.  We do
need to consider what to do when the undo file is chosen to write undo
for a transaction is finished and transaction still has to write more
data. There are a couple of options like we can first try to free the
space for any existing transaction which is already committed and if
there is no more freespace available we can either choose to wait or
allocate an additional XID to write UNDO as proposed in the initial
e-mail in this thread.  Like the previous approach, undo can be
accessed via shared buffers.  Similarly, each undo entry will have two
back pointers as described in the previous approach.  There are some
loose points like how freespace map will be organized which needs more
thoughts, but I think the core of the idea can be made to work.

This will overcome the disadvantages of previous approach (a) there
could be too many undo files (b) the cost associated with creating and
managing too many files.  I think this might be slightly lesser
efficient as compared to the previous approach in terms of purging the
space for unused undo space as it can't return the freed space in undo
file back to OS until all the transactions in undo file precedes
RecentGlobalXmin.  One disadvantage of this approach as compared to
the previous approach is that there can be some contention in finding
free undo slot in undo file, but I think the advantage it can bring by
having lesser files to manage will surpass such a disadvantage.

This still doesn't cover UNDO record contents for each operation.  I
think we can design that once the file format is fixed, although there
is no strict dependency between the two.

Thanks to Dilip Kumar for having offlist discussions on this topic to
figure out some of the design points.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления:
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] UNDO and in-place update