Re: INSERT...ON DUPLICATE KEY IGNORE

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: INSERT...ON DUPLICATE KEY IGNORE
Дата
Msg-id CAM3SWZRUfCaB1L8mxXki2hqS+AnJz2=P2LpHSUoUGxMRCTt5UQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: INSERT...ON DUPLICATE KEY IGNORE  (Andres Freund <andres@2ndquadrant.com>)
Ответы Re: INSERT...ON DUPLICATE KEY IGNORE  (Peter Geoghegan <pg@heroku.com>)
Re: INSERT...ON DUPLICATE KEY IGNORE  (Andres Freund <andres@2ndquadrant.com>)
Список pgsql-hackers
On Sat, Aug 31, 2013 at 11:34 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > Won't S2 in this case exclusively lock a page in foo_a_key (the one
>> > where 2 will reside on) for 3600s, while it's waiting for S1 to finish
>> > while getting the speculative insertion into foo_b_key?
>>
>> Yes. The way the patch currently handles that case is unacceptable. It
>> needs to be more concerned about holding an exclusive lock on
>> earlier-locked indexes when locking the second and subsequent indexes
>> iff it blocks on another transaction in the course of locking the
>> second and subsequent indexes.

I think that the fix here may be to do something analogous to what the
code already does: release all buffer locks, wait for the other
transaction to commit/abort, and try again. The difference being that
we try again across all unique indexes rather than just this one. I
have other obligations to take care of over the next couple of days,
so I don't really have time to code that up and satisfy myself that
it's robust; I'd prefer to have something more concrete to back this
thought up, but that's my immediate reaction for what it's worth.

We're looking for the first duplicate. So it would probably be okay
for the IGNORE case to not bother retrying and re-locking if the other
transaction committed (which, at least very broadly speaking, is the
likely outcome).

> After some thinking I don't think any solution primarily based on
> holding page level locks across other index operations is going to scale
> ok.

Just to be clear: I'm certainly not suggesting that it's okay to have
exclusive buffer locks held for any amount of time longer than
instantaneous. Furthermore, it isn't necessary to do anything
different to what we currently do with non-unique btree indexes during
speculative insertion - I don't treat them any differently, and they
only get locked in the usual way at the usual time, after heap
insertion. Much of the time, there'll only be a primary key index to
lock, and you'll have increased the window of the exclusive write-lock
(compared to the existing code when doing a regular insertion) by only
a very small amount - the duration of heap tuple insertion (the
baseline here is the duration of finding an insertion location on page
for index tuple, tuple insertion + possible page split). We're talking
about a write-lock on a btree leaf page, which is only going to block
other sessions from *finishing off* tuple insertion, and then only for
a minority of such insertions much of the time.

It's probably worth doing as much work up-front as possible (in terms
of asking questions about if we even want to lock the indexes and so
on, within ExecLockIndexTuples()), so as to decrease the window that
those locks are held. That's just an obvious optimization.

As users add unique indexes to tables, things do of course get worse
and worse. However, since in order for any of this to matter, there
has to be lots of concurrent activity in clustered ranges of values,
it's pretty likely that you'd conclude that tuple insertion should not
go ahead early on, and not end up doing too much exclusive/write
locking per tuple.

The question must be asked: if you don't think this will scale okay,
what are you comparing it to? The most useful basis of comparison is
likely to be other systems. Asking about uniqueness across many unique
indexes, and getting them all to commit to their unanimous position
for a moment in time is inherently involved - I don't see a way to do
it other than lock values, and I don't see any granularity at which
that's possible other than the btree page granularity.

> What I had in mind when playing around with this is the following trick:
> Just as in your patch split index insertion into two phases and perform
> one of them before inserting the heap tuple. In what you called
> "speculative insertion" don't stop at locking the page, but actually
> insert an index tuple in the index. As we haven't yet inserted the heap
> tuple we obviously can't insert an actual tuple. Instead set a flag on
> the item and reuse the the item pointer to store the xid of the
> inserting process. I coined those items "insertion promises"...

I did consider that sort of approach from an early stage - Simon
suggested it to me privately early last year. Actually, I even pointed
out in the 2012 developer meeting that there were unused IndexTuple
flag bits available for such purposes.

> Index searches will ignore promises they see, but concurrent index
> insertions will notice it's a promise and wait for the xid (or not, if
> it has aborted).

If inserters must block, they'll have to restart - I'm sure you'll
agree that they can't sit on even a shared/read buffer lock
indefinitely. So, have you really saved anything? Especially since
you'll have to search at least twice (more shared locks on *root*
pages) where my scheme only needs to search once. Also, are you really
sure that index searches should have license to ignore these promises?
What about dirty snapshots?

The question I should really ask about your scheme is the same
question you've asked of mine: What about multiple unique indexes on
the same table? Complexity issues aside, is it really better to get
exclusive/write locks on each of them twice (albeit shorter held
ones), and approximately double the number of shared locks too, just
to avoid concurrently held index tuple exclusive locks?

Now, I guess you probably could make all this work - somehow - but
that's even more complexity and even more cycles for vanishingly small
gain in one specific area. And frankly I don't think there was much
gain to begin with. You're glossing over too many details about what
your scheme would need to do to make a fair comparison with mine. It
seems like you're really concerned about scalability in the sense of
scaling the number of unique indexes on one table when there is lots
of concurrent activity on the same range of values across multiple
unique indexes. With my patch, and a single primary key unique index,
the overhead of this is extended window of locking is very small
indeed - because we usually only walk the tree and lock once, the
overhead relative to regular insertion is fixed. 2 or 3 unique indexes
aren't much worse in practice, I suspect.

> Imo that solution is fairly elegant because it doesn't cause any heap
> bloat, and it causes the same amount of index bloat
> "insert-into-heap-first" type of solutions cause.

I don't think that's a reasonable comparison. Bloating indexes with
dead *duplicate* tuples is, in general, worse than doing the same with
the heap. If you end up with a heavily bloated index with lots of
duplicates, that will adversely affect performance worse than with
non-duplicate index tuples (see [1] for example). That was one reason
that HOT helped as much as it did, I believe.

More fundamentally, I don't like the idea that very routine bloat is
okay. I've seen people abuse unique indexes by using them in a way
that made it inevitable that there'd be a constant stream of
constraint violations, with attendant bloat. I don't like that.

> It also has pretty
> good concurrency behaviour because you never need to retain a page lock
> for longer than a normal index insertion.
> It's certainly a bit more complex than your solution tho...

It's a lot more complex. You have to write an index tuple, and WAL-log
that, just to do nothing. You may also need to WAL-log "insertion
proper", which in your scheme involved updating the index in-place.

Having said all that, I'm perfectly willing to talk about the
scalability aspects of what I've done when I produce a revision that
fixes that bug. It would certainly help your argument here if you had
a benchmark, since it is my understanding that your concern is
entirely about performance/scalability.

Thanks for your input.

Unrelated:

Does the patch need logic about looking to the primary key for
duplicates first? Or maybe, if it's a SERIAL column, to look there
last since it's unlikely to have a duplicate and it might hurt more to
lock a leaf page there? It doesn't really matter now, but for upsert
we're going to want that. For upsert, which row to update may be
ambiguous, so it makes sense to prioritise indexes for checking if
only the first duplicate found will be updated in respect of a single
inserted tuple. This kind of thing also helps performance.

*** References ***

[1] https://github.com/postgres/postgres/blob/REL9_3_STABLE/src/backend/access/nbtree/nbtinsert.c#L556

-- 
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: CREATE FUNCTION .. SET vs. pg_dump
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: INSERT...ON DUPLICATE KEY IGNORE