Re: Deriving Recovery Snapshots

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Deriving Recovery Snapshots
Дата
Msg-id 48FEF28A.5060803@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Deriving Recovery Snapshots  (Simon Riggs <simon@2ndQuadrant.com>)
Ответы Re: Deriving Recovery Snapshots  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Deriving Recovery Snapshots  (Simon Riggs <simon@2ndQuadrant.com>)
Список pgsql-hackers
Simon Riggs wrote:
> On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> * The backend slot may not be reused for some time, so we should take
>>> additional actions to keep state current and true. So we choose to log a
>>> snapshot from the master into WAL after each checkpoint. This can then
>>> be used to cleanup any unobserved xids. It also provides us with our
>>> initial state data, see later.
>> We don't need to log a complete snapshot, do we? Just oldestxmin should 
>> be enough.
> 
> Possibly, but you're thinking that once we're up and running we can use
> less info.
> 
> Trouble is, you don't know when/if the standby will crash/be shutdown.
> So we need regular full snapshots to allow it to re-establish full
> information at regular points. So we may as well drop the whole snapshot
> to WAL every checkpoint. To do otherwise would mean more code and less
> flexibility.

Surely it's less code to write the OldestXmin to the checkpoint record, 
rather than a full snapshot, no? And to read it off the checkpoint record.

>>> UnobservedXids is maintained as a sorted array. This comes for free
>>> since xids are always added in xid assignment order. This allows xids to
>>> be removed via bsearch when WAL records arrive for the missing xids. It
>>> also allows us to stop searching for xids once we reach
>>> latestCompletedXid.
>> If we're going to have an UnobservedXids array, why don't we just treat 
>> all in-progress transactions as Unobserved, and forget about the dummy 
>> PROC entries?
> 
> That's a good question and I expected some debate on that.
> 
> The main problem is fatal errors that don't write abort records. By
> reusing the PROC entries we can keep those to a manageable limit. If we
> don't have that, the number of fatal errors could cause that list to
> grow uncontrollably and we might overflow any setting, causing snapshots
> to stall and new queries to hang. We really must have a way to place an
> upper bound on the number of unobserved xacts. So we really need the
> proc approach. But we also need the UnobservedXids array.

If you write the oldestxmin (or a full snapshot, including the 
oldestxmin) to each checkpoint record, you can crop out any unobserved 
xids older than that, when you replay the checkpoint record.

> Having only an UnobservedXid array was my first thought and I said
> earlier I would do it without using procs. Bad idea. Using the
> UnobservedXids array means every xact removal requires a bsearch,
> whereas with procs we can do a direct lookup, removing all xids in one
> stroke. Much better for typical cases.

How much does that really matter? Under normal circumstances, the array 
would be quite small anyway. A bsearch of a relatively small array isn't 
that expensive. Or a hash table, so that removing/inserting items 
doesn't need to shift all the following entries.

>> Also, I can't help thinking that this would be a lot simpler if we just 
>> treated all subtransactions the same as top-level transactions. The only 
>> problem with that is that there can be a lot of subtransactions, which 
>> means that we'd need a large UnobservedXids array to handle the worst 
>> case, but maybe it would still be acceptable?
> 
> Yes, you see the problem. Without subtransactions, this would be a
> simple issue to solve.
> 
> In one sense, I do as you say. When we make a snapshot we stuff the
> UnobservedXids into the snapshot *somewhere*. We don't know whether they
> are top level or subxacts. But we need a solution for when we run out of
> top-level xid places in the snapshot. Which has now been provided,
> luckily.
> 
> If we have no upper bound on snapshot size then *all* backends would
> need a variable size snapshot. We must solve that problem or accept
> having people wait maybe minutes for a snapshot in worst case. I've
> found one way of placing a bound on the number of xids we need to keep
> in the snapshot. If there is another, better way of keeping it bounded I
> will happily adopt it. I spent about 2 weeks sweating this issue...

How about:

1. Keep all transactions and subtransactions in UnobservedXids.
2. If it fills up, remove all subtransactions from it, that the startup 
process knows to be subtransactions and knows the parents, and update 
subtrans. Mark the array as overflowed.

To take a snapshot, a backend simply copies UnobservedXids array and the 
flag. If it hasn't overflowed, a transaction is considered to be in 
progress if it's in the array. If it has overflowed, and the xid is not 
in the array, check subtrans

Note that the startup process sees all WAL records, so it can do 
arbitrarily complex bookkeeping in backend-private memory, and only 
expose the necessary parts in shared mem. For example, it can keep track 
of the parent-child relationships of the xids in UnobservedXids, but the 
backends taking snapshots don't need to know about that. For step 2 to 
work, that's exactly what the startup process needs to keep track of.

For the startup process to know about the parent-child relationships, 
we'll need something like WAL changes you suggested. I'm not too 
thrilled about adding a new field to all WAL records. Seems simpler to 
just rely on the new WAL records on AssignTransactionId(), and we can 
only do it, say, every 100 subtransactions, if we make the 
UnobservedXids array big enough (100*max_connections).

This isn't actually that different from your proposal. The big 
difference is that instead of PROC entries and UnobservedXids, all 
transactions are tracked in UnobservedXids, and instead of caching 
subtransactions in the subxids array in PROC entries, they're cached in 
UnobservedXids as well.


Aanother, completely different approach, would be to forget about xid 
arrays altogether, and change the way snapshots are taken: just do a 
full memcpy of the clog between xmin and xmax. That might be pretty slow 
if xmax-xmin is big, though.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Martin Pihlak
Дата:
Сообщение: Re: Withdraw PL/Proxy from commitfest
Следующее
От: Mark Kirkwood
Дата:
Сообщение: Re: Bitmap Indexes: request for feedback