Re: INSERT...ON DUPLICATE KEY IGNORE
От | Peter Geoghegan |
---|---|
Тема | Re: INSERT...ON DUPLICATE KEY IGNORE |
Дата | |
Msg-id | CAM3SWZRUfCaB1L8mxXki2hqS+AnJz2=P2LpHSUoUGxMRCTt5UQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: INSERT...ON DUPLICATE KEY IGNORE (Andres Freund <andres@2ndquadrant.com>) |
Ответы |
Re: INSERT...ON DUPLICATE KEY IGNORE
(Peter Geoghegan <pg@heroku.com>)
Re: INSERT...ON DUPLICATE KEY IGNORE (Andres Freund <andres@2ndquadrant.com>) |
Список | pgsql-hackers |
On Sat, Aug 31, 2013 at 11:34 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > Won't S2 in this case exclusively lock a page in foo_a_key (the one >> > where 2 will reside on) for 3600s, while it's waiting for S1 to finish >> > while getting the speculative insertion into foo_b_key? >> >> Yes. The way the patch currently handles that case is unacceptable. It >> needs to be more concerned about holding an exclusive lock on >> earlier-locked indexes when locking the second and subsequent indexes >> iff it blocks on another transaction in the course of locking the >> second and subsequent indexes. I think that the fix here may be to do something analogous to what the code already does: release all buffer locks, wait for the other transaction to commit/abort, and try again. The difference being that we try again across all unique indexes rather than just this one. I have other obligations to take care of over the next couple of days, so I don't really have time to code that up and satisfy myself that it's robust; I'd prefer to have something more concrete to back this thought up, but that's my immediate reaction for what it's worth. We're looking for the first duplicate. So it would probably be okay for the IGNORE case to not bother retrying and re-locking if the other transaction committed (which, at least very broadly speaking, is the likely outcome). > After some thinking I don't think any solution primarily based on > holding page level locks across other index operations is going to scale > ok. Just to be clear: I'm certainly not suggesting that it's okay to have exclusive buffer locks held for any amount of time longer than instantaneous. Furthermore, it isn't necessary to do anything different to what we currently do with non-unique btree indexes during speculative insertion - I don't treat them any differently, and they only get locked in the usual way at the usual time, after heap insertion. Much of the time, there'll only be a primary key index to lock, and you'll have increased the window of the exclusive write-lock (compared to the existing code when doing a regular insertion) by only a very small amount - the duration of heap tuple insertion (the baseline here is the duration of finding an insertion location on page for index tuple, tuple insertion + possible page split). We're talking about a write-lock on a btree leaf page, which is only going to block other sessions from *finishing off* tuple insertion, and then only for a minority of such insertions much of the time. It's probably worth doing as much work up-front as possible (in terms of asking questions about if we even want to lock the indexes and so on, within ExecLockIndexTuples()), so as to decrease the window that those locks are held. That's just an obvious optimization. As users add unique indexes to tables, things do of course get worse and worse. However, since in order for any of this to matter, there has to be lots of concurrent activity in clustered ranges of values, it's pretty likely that you'd conclude that tuple insertion should not go ahead early on, and not end up doing too much exclusive/write locking per tuple. The question must be asked: if you don't think this will scale okay, what are you comparing it to? The most useful basis of comparison is likely to be other systems. Asking about uniqueness across many unique indexes, and getting them all to commit to their unanimous position for a moment in time is inherently involved - I don't see a way to do it other than lock values, and I don't see any granularity at which that's possible other than the btree page granularity. > What I had in mind when playing around with this is the following trick: > Just as in your patch split index insertion into two phases and perform > one of them before inserting the heap tuple. In what you called > "speculative insertion" don't stop at locking the page, but actually > insert an index tuple in the index. As we haven't yet inserted the heap > tuple we obviously can't insert an actual tuple. Instead set a flag on > the item and reuse the the item pointer to store the xid of the > inserting process. I coined those items "insertion promises"... I did consider that sort of approach from an early stage - Simon suggested it to me privately early last year. Actually, I even pointed out in the 2012 developer meeting that there were unused IndexTuple flag bits available for such purposes. > Index searches will ignore promises they see, but concurrent index > insertions will notice it's a promise and wait for the xid (or not, if > it has aborted). If inserters must block, they'll have to restart - I'm sure you'll agree that they can't sit on even a shared/read buffer lock indefinitely. So, have you really saved anything? Especially since you'll have to search at least twice (more shared locks on *root* pages) where my scheme only needs to search once. Also, are you really sure that index searches should have license to ignore these promises? What about dirty snapshots? The question I should really ask about your scheme is the same question you've asked of mine: What about multiple unique indexes on the same table? Complexity issues aside, is it really better to get exclusive/write locks on each of them twice (albeit shorter held ones), and approximately double the number of shared locks too, just to avoid concurrently held index tuple exclusive locks? Now, I guess you probably could make all this work - somehow - but that's even more complexity and even more cycles for vanishingly small gain in one specific area. And frankly I don't think there was much gain to begin with. You're glossing over too many details about what your scheme would need to do to make a fair comparison with mine. It seems like you're really concerned about scalability in the sense of scaling the number of unique indexes on one table when there is lots of concurrent activity on the same range of values across multiple unique indexes. With my patch, and a single primary key unique index, the overhead of this is extended window of locking is very small indeed - because we usually only walk the tree and lock once, the overhead relative to regular insertion is fixed. 2 or 3 unique indexes aren't much worse in practice, I suspect. > Imo that solution is fairly elegant because it doesn't cause any heap > bloat, and it causes the same amount of index bloat > "insert-into-heap-first" type of solutions cause. I don't think that's a reasonable comparison. Bloating indexes with dead *duplicate* tuples is, in general, worse than doing the same with the heap. If you end up with a heavily bloated index with lots of duplicates, that will adversely affect performance worse than with non-duplicate index tuples (see [1] for example). That was one reason that HOT helped as much as it did, I believe. More fundamentally, I don't like the idea that very routine bloat is okay. I've seen people abuse unique indexes by using them in a way that made it inevitable that there'd be a constant stream of constraint violations, with attendant bloat. I don't like that. > It also has pretty > good concurrency behaviour because you never need to retain a page lock > for longer than a normal index insertion. > It's certainly a bit more complex than your solution tho... It's a lot more complex. You have to write an index tuple, and WAL-log that, just to do nothing. You may also need to WAL-log "insertion proper", which in your scheme involved updating the index in-place. Having said all that, I'm perfectly willing to talk about the scalability aspects of what I've done when I produce a revision that fixes that bug. It would certainly help your argument here if you had a benchmark, since it is my understanding that your concern is entirely about performance/scalability. Thanks for your input. Unrelated: Does the patch need logic about looking to the primary key for duplicates first? Or maybe, if it's a SERIAL column, to look there last since it's unlikely to have a duplicate and it might hurt more to lock a leaf page there? It doesn't really matter now, but for upsert we're going to want that. For upsert, which row to update may be ambiguous, so it makes sense to prioritise indexes for checking if only the first duplicate found will be updated in respect of a single inserted tuple. This kind of thing also helps performance. *** References *** [1] https://github.com/postgres/postgres/blob/REL9_3_STABLE/src/backend/access/nbtree/nbtinsert.c#L556 -- Peter Geoghegan
В списке pgsql-hackers по дате отправления: