Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Дата
Msg-id CAM3SWZR5PQTDyJBMSod0_wzuPdRdj1OAcMpSi6xtiScfYeE64w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Thu, Sep 26, 2013 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, I think we can rule out value locks that are held for the
>> duration of a transaction right away. That's just not going to fly.
>
> I think I agree with that.  I don't think I remember hearing that proposed.

I think I might have been unclear - I mean locks that are held for the
duration of *another* transaction, not our own, as we wait for that
other transaction to commit/abort. I think that earlier remarks from
yourself and Andres implied that this would be necessary. Perhaps I'm
mistaken. Your most recent design proposal doesn't do this, but I
think that that's only because it restricts the user to a single
unique index - it would otherwise be necessary to sit on the earlier
value locks (index tuples belonging to an unfinished transaction)
pending the completion of some other conflicting transaction, which
has numerous disadvantages (as described in my "it suits my purposes
to have the value locks be held for only an instant" mail to you [1]).

>> If we're really lucky, maybe the value locking stuff can be
>> generalized or re-used as part of a btree index insertion buffer
>> feature.
>
> Well, that would be nifty.

Yes, it would. I think, based on a conversation with Rob Wultsch, that
it's another area where MySQL still do quite a bit better.

> I see.  Well, we could try to mimic their semantics, I suppose.  Those
> semantics seem like a POLA violation to me; who would have thought
> that a REPLACE could delete multiple tuples?  But what do I know?

I think that it's fairly widely acknowledged to not be very good.
Every MySQL user uses INSERT...ON DUPLICATE KEY UPDATE instead.

>> The only way that's going to work is if you say "use this unique
>> index", which will look pretty gross in DML.

> Yeah, it's kind of awful.

It is.

>> What definition of equality or inequality?
>
> Binary equality, same as we'd use to decide whether an update can be done HOT.

I guess that's acceptable in theory, because binary equality is
necessarily a *stricter* condition than equality according to some
operator that is an equivalence relation. But the fact remains that
you're just ameliorating the problem by making it happen less often
(both through this kind of trick, but also by restricting us to one
unique index), not actually fixing it.

> Well, there are two separate issues here: what to do about MVCC, and
> how to do the locking.

Totally agreed. Fortunately, unlike the different aspects of value and
row locking, I think that these two questions can be reasonable
considered independently.

> From an MVCC perspective, I can think of only
> two behaviors when the conflicting tuple is committed but invisible:
> roll back, or update it despite it being invisible.  If you're saying
> you don't like either of those choices, I couldn't agree more, but I
> don't have a third idea.  If you do, I'm all ears.

I don't have another idea either. In fact, I'd go so far as to say
that doing any third thing that's better than those two to any
reasonable person is obviously impossible. But I'd add that we simple
cannot rollback at read committed, so we're just going to have to hold
our collective noses and do strange things with visibility.

FWIW, I'm tentatively looking at doing something like this:

*************** HeapTupleSatisfiesMVCC(HeapTuple htup, S
*** 958,963 ****
--- 959,975 ---- * By here, the inserting transaction has committed - have to check * when... */
+
+ /*
+ * Not necessarily visible to snapshot under conventional MVCC rules, but
+ * still locked by our xact and not updated -- importantly, normal MVCC
+ * semantics apply when we update the row, so only one version will be
+ * visible at once
+ */
+ if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
+ TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
+ return true;
+ if (XidInMVCCSnapshot(HeapTupleHeaderGetXmin(tuple), snapshot)) return false; /* treat as still in progress */

This is something that I haven't given remotely enough thought yet, so
please take it with a big grain of salt.

> In terms of how to do the locking, what I'm mostly saying is that we
> could try to implement this in a way that invents as few new concepts
> as possible.  No promise tuples, no new SLRU, no new page-level bits,
> just index tuples and heap tuples and so on.  Ideally, we don't even
> change the WAL format, although step 2 might require a new record
> type.  To the extent that what I actually described was at variance
> with that goal, consider it a defect in my explanation rather than an
> intent to vary.  I think there's value in considering such an
> implementation because each new thing that we have to introduce in
> order to get this feature is a possible reason for it to be rejected -
> for modularity reasons, or because it hurts performance elsewhere, or
> because it's more code we have to maintain, or whatever.

There is certainly value in considering that, and you're right to take
that tact - it is generally valuable to have a patch be minimally
invasive. However, ultimately that's just one aspect of any given
design, an aspect that needs to be weighed against others where there
is a tension. Obviously in this instance I believe, rightly or
wrongly, that doing more - adding more infrastructure than might be
considered strictly necessary - is the least worst thing. Also,
sometimes the apparent similarity of a design to what we have today is
illusory - certainly, I think you'd at least agree that the problems
that bloating during the interim between value locking and row locking
present are qualitatively different to other problems that bloat
presents in all existing scenarios.

FWIW I'm not doing things this way because I'm ambitious, and am
willing to risk not having my work accepted if that means I might get
something that performs better, or has more features (like not
requiring the user to specify a unique index in DML). Rather, I'm
doing things this way because I sincerely believe that on balance mine
is the best, most forward-thinking design proposed to date, and
therefore the design most likely to ultimately be accepted (even
though I do of course accept that there are numerous aspects that need
to be worked out still). If the whole design is ultimately not
accepted, that's something that I'll have to deal with, but understand
that I don't see any way to play it safe here (except, I suppose, to
give up now).

> Now, what I hear you saying is, gee, the performance of that might be
> terrible.  I'm not sure that I believe that, but it's possible that
> you're right.

I think that the average case will be okay, but not great. I think
that the worst case performance may well be unforgivably bad, and it's
a fairly plausible worst case. Even if someone disputes its
likelihood, and demonstrates that it isn't actually that likely, that
isn't necessarily very re-assuring - getting all the details right is
pretty subtle, especially compared to just not bloating, and just
deferring to the btree code whose responsibilities include enforcing
uniqueness.

> Much seems to depend on what you think the frequency of
> conflicts will be, and perhaps I'm assuming it will be low while
> you're assuming a higher value.  Regardless, if the performance of the
> sort of implementation I'm talking about would be terrible (under some
> agreed-upon definition of what terrible means in this context), then
> that's a good argument for not doing it that way.  I'm just not
> convinced that's the case.

All fair points. Forgive me for repeating myself, but the word
"conflict" needs to be used carefully here, because there are two
basic ways of interpreting it - something that happens due to
concurrent xact activity around the same values, and something that
happens due to there already being some row there with a conflicting
value from some time ago (or that our xact inserted, even). Indeed,
the former *is* generally much less likely than the latter, so the
distinction is important. You could also further differentiate between
value level and row level conflicts, or at least I think that you
should, and that we should allow for value level conflicts.

Let me try and explain myself better, with reference to a concrete
example. Suppose we have a table with a primary key column, A, and a
unique constraint column, B, and we lock the pk value first and the
unique constraint value second. I'm assuming your design, but allowing
for multiple unique indexes because I don't think doing anything less
will be accepted - promise tuples have some of the same problems, as
well as some other idiosyncratic ones (see my earlier remarks on
recovery/freezing [2] for examples of those).

So there is a fairly high probability that the pk value on A will be
unique, and a fairly low probability that the unique constraint value
on B will be unique, at least in this usage pattern of interest, where
the user is mostly going to end up updating. Mostly, we insert a
speculative regular index tuple (that points to a speculative heap
tuple that we might decide to kill) into the pk column, A, right away,
and then maybe block pending the resolution of a conflicting
transaction on the unique constraint column B. I don't think we have
any reasonable way of not blocking on A - if we go clean it up for the
wait, that's going to bloat quite dramatically, *and* we have to WAL
log. In any case you seemed to accept that cleaning up bloat
synchronously like that was just going to be too expensive. So I
suppose that rules that out. That just leaves sitting on the "value
lock" (that the pk index tuple already inserted effectively is)
indefinitely, pending the outcome of the first transaction.

What are the consequences of sitting on that value lock indefinitely?
Well, xacts are going to block on the pk value much more frequently,
by simple virtue of the fact that the value locks there are held for a
long time - they just needed to hear a "no" answer, which the unique
constraint was in most cases happy to immediately give, so this is
totally unnecessary. Contention is now in a certain sense almost as
bad for every unique index as it is for the weakest link. That's only
where the problems begin, though, and it isn't necessary for there to
be bad contention on more than one unique index (the pk could just be
on a serial column, say) to see bad effects.

So your long-running xact that's blocking all the other sessions on
its proposed value for a (or maybe even b) - that finally gets to
proceed. Regardless of whether it commits or aborts, there will be a
big bloat race. This is because when the other sessions get the
go-ahead to proceed, they'll all run to get the row lock (one guy
might insert instead). Only one will be successful, but they'll all
kill their heap tuple on the assumption that they'll probably lock the
row, which is only true in the average case. Now, maybe you can teach
them to not bother killing the heap tuple when there are no index
tuples actually inserted to ruin things, but then maybe not, and maybe
it wouldn't help in this instance if you did teach them (because
there's a third, otherwise irrelevant constraint or whatever).

Realize you can generally only kill the heap tuple *before* you have
the row lock, because otherwise a totally innocent non-HOT update (may
not update any unique indexed columns at all) will deadlock with your
session, which I don't think is defensible, and will probably happen
often if allowed to (after all, this is upsert - users are going to
want to update their locked rows!).

So in this scenario, each of the original blockers will simultaneously
try again and again to get the row lock as one transaction proceeds
with locking and then probably commits. For every blocker's iteration
(there will be n_blockers - 1 iterations, with each iteration
resolving things for one blocker only), each blocker bloats. We're
talking about creating duplicates in unique indexes for each and every
iteration, for each and every blocker, and we all know duplicates in
btree indexes are, in a word, bad. I can imagine one or two
ridiculously bloated indexes in this scenario. It's even degenerative
in another direction - the more aggregate bloat we have, the slower
the jump from value to row locking takes, the more likely conflicts
are, the more likely bloat is.

Contrast this with my design, where re-ordering of would-be
conflicters across unique indexes (or serialization failures) can
totally nip this in the bud *if* the contention can be re-ordered
around, but if not, at least there is no need to worry about
aggregating bloat at all, because it creates no bloat.

Now, you're probably thinking "but I said I'll reverify the row for
conflicts across versions, and it'll be fine - there's generally no
need to iterate and bloat again provided no unique-indexed column
changed, even if that is more likely to occur due to the clean-up pre
row locking". Maybe I'm being unfair, but apart from requiring a
considerable amount of additional infrastructure of its own (a new
EvalPlanQual()-like thing that cares about binary equality in respect
of some columns only across row versions), I think that this is likely
to turn out to be subtly flawed in some way, simply because of the
modularity violation, so I haven't given you the benefit of the doubt
about your ability to frequently avoid repeatedly asking the index +
btree code what to do. For example, partial unique indexes - maybe
something that looked okay before because you simply didn't have cause
to insert into that unique index has to be considered in light of the
fact that it changed across row versions - are you going to stash that
knowledge too, and is it likely to affect someone who might otherwise
not have these issues really badly because we have to assume the worst
there? Do you want to do a value verification thing for that too, as
we do when deciding to insert into partial indexes in the first place?
Even if this new nothing-changed-across-versions infrastructure works,
will it work often enough in practice to be worth it -- have you ever
tracked the proportion of updates that were HOT updates in a
production DB? It isn't uncommon for it to not be great, and I think
that we can take that as a proxy for how well this will work. It could
be totally legitimate for the UPDATE portion to alter a unique indexed
column all the time.

> Basically, if there's a way we can do this without changing the
> on-disk format (even in a backward-compatible way), I'd be strongly
> inclined to go that route unless we have a really compelling reason to
> believe it's going to suck (or be outright impossible).

I don't believe that anything that I have proposed needs to break our
on-disk format - I hadn't considered what the implications might be in
this area for other proposals, but it's possible that that's an
additional advantage of doing value locking all in-memory.

[1] http://www.postgresql.org/message-id/CAM3SWZRV0F-DjgpXu-WxGoG9eEcLawNrEiO5+3UKRp2e5s=TSg@mail.gmail.com

[2] http://www.postgresql.org/message-id/CAM3SWZQUUuYYcGksVytmcGqACVMkf1ui1uvfJekM15YkWZpzhw@mail.gmail.com
-- 
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Karl O. Pinc"
Дата:
Сообщение: Re: backup.sgml-cmd-v003.patch
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE