INSERT ... ON CONFLICT {UPDATE | IGNORE}
От | Peter Geoghegan |
---|---|
Тема | INSERT ... ON CONFLICT {UPDATE | IGNORE} |
Дата | |
Msg-id | CAM3SWZTEODEJLz82LK4eF2HYX+qEKrbc8-Vtq3_-aOf6kRSfiA@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}
(Andreas Karlsson <andreas@proxel.se>)
Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} (Peter Geoghegan <pg@heroku.com>) Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} (Robert Haas <robertmhaas@gmail.com>) Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} (Peter Geoghegan <pg@heroku.com>) Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} (Simon Riggs <simon@2ndquadrant.com>) Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} (Simon Riggs <simon@2ndquadrant.com>) |
Список | pgsql-hackers |
Attached WIP patch extends the INSERT statement, adding a new ON CONFLICT {UPDATE | IGNORE} clause. This allows INSERT statements to perform UPSERT operations (if you want a more formal definition of UPSERT, I refer you to my pgCon talk's slides [1], or the thread in which I delineated the differences between SQL MERGE and UPSERT [2]). The patch builds on previous work in this area, and incorporates feedback from Kevin and Andres. Overview ======= Example usage: INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT UPDATE SET val = 'update'; Essentially, the implementation has all stages of query processing track some auxiliary UPDATE state. So, for example, during parse analysis, UPDATE transformation occurs in an ad-hoc fashion tightly driven by the parent INSERT, but using the existing infrastructure (i.e. transformStmt()/transformUpdateStmt() is called, and is insulated from having to care about the feature as a special case). There are some restrictions on what this auxiliary update may do, but FWIW there are considerably fewer than those that the equivalent MySQL or SQLite feature imposes on their users. All of the following SQL queries are valid with the patch applied: -- Nesting within wCTE: WITH t AS ( INSERT INTO z SELECT i, 'insert' FROM generate_series(0, 16) i ON CONFLICT UPDATE SET v = v || 'update' -- use of operators/functions in targetlist RETURNING * -- only projects inserted tuples, never updated tuples ) SELECT * FROM t JOIN y ON t.k = y.a ORDER BY a, k; -- IGNORE variant: INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT IGNORE; -- predicate within UPDATE auxiliary statement (row is still locked when the UPDATE predicate isn't satisfied): INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT UPDATE WHERE val != 'delete'; As with SQL MERGE (at least as implemented in other systems), subqueries may not appear within the UPDATE's targetlist, nor may they appear within the special WHERE clause. But the "INSERT part" of the query has no additional limitations, so you may for example put subqueries within a VALUES() clause, or INSERT...SELECT...ON CONFLICT UPDATE... just as you'd expect. INSERT has been augmented with a new clause, but that clause does not unreasonably fail to play nice with any other aspect of insertion. (Actually, that isn't quite true, since at least for now table inheritance, updatable views and foreign tables are unsupported. This can be revisited.) I think that without initially realizing it, I copied the SQLite syntax [3]. However, unlike with that SQLite feature, CONFLICT only refers to a would-be duplicate violation, and not a violation of any other kind of constraint. How this new approach works (Executor and Optimizer stuff) ============================================ During the execution of the parent ModifyTable, a special auxiliary subquery (the UPDATE ModifyTable) is considered as a special case. This is not a subplan of the ModifyTable node in the conventional sense, and so does not appear within EXPLAIN output. However, it is more or less independently planned, and entirely driven by the INSERT ModifyTable. ExecModifyTable() is never called with this special auxiliary plan state passed directly. Rather, its parent manages the process as the need arises. ExecLockUpdateTuple() locks and potentially updates tuples, using the EvalPlanQual() mechanism (even at higher isolation levels, with appropriate precautions). The per-tuple expression context of the auxiliary query/plan is used with EvalPlanQual() from within ExecLockUpdateTuple() (the new routine tasked with locking and updating on conflict). There is a new ExecUpdate() call site within ExecLockUpdateTuple(). Given the restrictions necessarily imposed on this pseudo-rescanning (principally the outright rejection of anything that necessitates PARAM_EXEC parameters during planning), this is safe, as far as I'm aware. It is convenient to be able to re-use infrastructure in such a way as to more or less handle the UPDATE independently, driven by the INSERT, except for execution which is more directly handled by the INSERT (i.e. there is no ExecModifyTable() call in respect of this new auxiliary ModifyTable plan). Granted, it is kind of bizarre that the auxiliary query may have a more complex plan than is necessary for our purposes, but it doesn't actually appear to be a problem when "rescanning" (Like a SELECT FOR UPDATE/FOR SHARE's node, we call EvalPlanQualSetTuple() directly). It is likely worthwhile to teach the optimizer that we really don't care about how the one and only base rel within the UPDATE auxiliary subquery (the target table) is scanned, if only to save a few cycles. I have (temporarily) hacked the optimizer to prevent index-only scans, which are problematic here, by adding disable_cost when a query parse tree that uses the feature is seen. Although what I've done is a temporary kludge, the basic idea of forcing a particular type of relation scan has a precedent: UPDATE WHERE CURRENT OF artificially forces a TID scan, because only a TID scan will work correctly there. I couldn't come up with a convenient way to artificially inject disable_cost into alternative scan types, in the less invasive style of isCurrentOf, because there is no convenient qual to target within cost_qual_eval(). As in previous incarnations, we lock each tuple (although, of course, only with the UPDATE variant). We may or may not also actually proceed with the update, depending on whether or not the user-specified special update predicate (if any) is satisfied. But if we do, EvalPlanQual() is (once the tuple is locked) only ever evaluated on a conclusively committed and locked-by-us conflict tuple as part of the process of updating, even though it's possible for the UPDATE predicate to be satisfied where conceivably it would not be satisfied by the tuple version actually visible to the command's MVCC snapshot. I think this is the correct behavior. We all seem to be in agreement that we should update at READ COMMITTED if *no* version of the tuple is visible. It seems utterly arbitrary to me to suggest that on the one hand it's okay to introduce one particular "MVCC violation", but not another equivalent one. The first scenario is one in which we update despite our update's (or rather insert's) "predicate" not being satisfied (according to our MVCC snapshot). The second scenario is one in which the same "predicate" is also not satisfied according to our MVCC snapshot, but in a slightly different way. Why bother introducing a complicated distinction, if it's a distinction without a difference? I'd rather have a behavior that is consistent, easy to reason about, and easy to explain. And so, the predicate is considered once, after conclusively locking a conflict tuple. It feels natural and appropriate to me that if the special UPDATE qual isn't satisfied, we still lock the tuple. After all, in order to make a conclusive determination about the qual not being satisfied, we need to lock the tuple. This happens to insulate ExecUpdate() from having to care about "invisible tuples", which are now possible (although we still throw an error, just with a useful error message that phrases the problem in reference to this new feature). Of course, at higher isolation levels serialization errors are thrown when something inconsistent with the higher level's guarantees would otherwise need to occur (even for the IGNORE variant). Still, interactions with SSI, and preserving the guarantees of SSI should probably be closely considered by a subject matter expert. Omission ======= The patch currently lacks a way of referencing datums rejected for insertion when updating. The way MySQL handles the issue seems questionable. They allow you to do something like this: INSERT INTO upsert (key, val) VALUES (1 'val') ON DUPLICATE KEY UPDATE val = VALUES(val); The implication is that the updated value comes from the INSERT's VALUES() list, but emulating that seems like a bad idea. In general, at least with Postgres it's entirely possible that values rejected differ from the values appearing in the VALUES() list, due to the effects of before triggers. I'm not sure whether or not we should assume equivalent transformations during any UPDATE before triggers. This is an open item. I think it makes sense to deal with it a bit later. "Value locking" =========== To date, on-list discussion around UPSERT has almost exclusively concerned what I've called "value locking"; the idea of locking values in unique indexes in the abstract (to establish the right to insert ahead of time). There was some useful discussion on this question between myself and Heikki back around December/January. Ultimately, we were unable to reach agreement on an approach and discussion tapered off. However, Heikki did understand the concerns that informed by design. He recognized the need to be able to easily *release* value locks, so as to avoid "unprincipled deadlocks", where under high concurrency there are deadlocks between sessions that only UPSERT a single row at a time. I'm not sure how widely appreciated this point is, but I believe that Heikki appreciates it. It is a very important point in my opinion. I don't want an implementation that is in any way inferior to the "UPSERT looping subxact" pattern does (i.e. the plpsql thing that the docs suggest). When we left off, Heikki continued to favor an approach that involved speculatively inserting heap tuples, and then deleting them in the event of a conflict. This design was made more complicated when the need to *release* value locks became apparent (Heikki ended up making some changes to HeapTupleSatisfiesDirty(), as well as sketching a design for what you might call a "super delete", where xmin can be set to InvalidTransactionId for speculatively-inserted heap tuples). After all, it wasn't as if we could abort a subxact to release locks, which is what the "UPSERT looping subxact" pattern does. I think it's fair to say that that design became more complicated than initially anticipated [4] [5]. Anyway, the greater point here is that fundamentally, AFAICT Heikki and I were in agreement. Once you buy into the idea that we must avoid holding on to "value locks" of whatever form - as Heikki evidently did - then exactly what form they take is ultimately only a detail. Granted, it's a very important detail, but a detail nonetheless. It can be discussed entirely independently of all of this new stuff, and thank goodness for that. If anyone finds my (virtually unchanged) page heavyweight lock based value locking approach objectionable, I ask that the criticism be framed in a way that makes a sharp distinction between each of the following: 1. You don't accept that value locks must be easily released in the event of a conflict. Is anyone in this camp? It's far from obvious to me what side of this question Andres is on at this stage, for example. Robert might have something to say here too. 2. Having taken into account the experience of myself and Heikki, and all that is implied by taking that approach ***while avoiding unprincipled deadlocks***, you continue to believe that an approach based on speculative heap insertion, or some alternative scheme is better than what I have done to the nbtree code here, or you otherwise dislike something about the proposed value locking scheme. You accept that value locks must be released and released easily in the event of a conflict, but like Heikki you just don't like what I've done to get there. Since we can (I believe) talk about the value locking aspect and the rest of the patch independently, we should do so...unless you're in camp 1, in which case I guess that we'll have to thrash it out. Syntax, footguns ============= As I mentioned, I have incorporated feedback from Kevin Grittner. You may specify a unique index to merge on from within the INSERT statement, thus avoiding the risk of inadvertently having the update affect the wrong tuple due to the user failing to consider that there was a would-be unique violation within some other unique index constraining some other attribute. You may write the DML statement like this: INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT WITHIN upsert_pkey UPDATE SET val = 'update'; I think that there is a good chance that at least some people will want to make this mandatory. I guess that's fair enough, but I *really* don't want to *mandate* that users specify the name of their unique index in DML for obvious reasons. Perhaps we can come up with a more tasteful syntax that covers all interesting cases (consider the issues with partial unique indexes and before triggers for example, where a conclusion reached about which index to use during parse analysis may subsequently be invalidated by user-defined code, or ambiguous specifications in the face of overlapping attributes between two unique composite indexes, etc). The Right Thing is far from obvious, and there is very little to garner from other systems, since SQL MERGE promises essentially nothing about concurrency, both as specified by the standard and in practice. You don't need a unique index at all, and as I showed in my pgCon talk, there are race conditions even for a trivial UPSERT operations in all major SQL MERGE implementations. Note that making mandatory (via syntax) merging on one particular unique index buys the implementation no useful leeway. Just for example, the unprincipled deadlocks test case that illustrated the problem with early "promise tuple" style approaches to value locking [6] involved only a single unique index. AFAICT, the question of whether or not this should be mandatory is just a detail of the feature's high level design, as opposed to something expected to significantly influence the implementation. Testing, performance =============== As you'd expect, I've included both isolation tests and regression tests covering a reasonable variety of cases. In addition, stress testing is an important part of my testing strategy. Reviewers are encouraged to try out these test bash scripts: https://github.com/petergeoghegan/upsert (Interested hackers should request collaborator status on that Github project from me privately. I welcome new, interesting test cases.) The performance of the patch seems quite good, and is something that these stress-testing bash scripts also test. Upserts are compared against "equivalent" inserts when we know we'll never update, and against "equivalent" updates when we know we'll never insert. On an 8 core test server, I can sustain ~90,000 ordinary insert transactions per second on an unlogged table defined as follows: create unlogged table foo ( merge serial primary key, b int4, c text ); In all cases pgbench uses 8 clients (1 per CPU core). With "equivalent" upserts, it's about ~66,000 TPS. But this is a particularly unsympathetic case, because I've deliberately exaggerated the effects of heavyweight lock contention on leaf pages by using a serial primary key. Plus, there's the additional planning and parsing overhead. When comparing updating with updating upserting, it's a similar story. 100,000 tuples are pre-inserted in each case. I can sustain ~98,000 TPS with plain updates, or ~70,000 TPS with "equivalent" upserts. B-Tree index page heavyweight lock contention probably explains some of the difference between "UPSERT inserts" and "UPSERT updates". Interlocking with VACUUM, race conditions =============================== In previous revisions, when we went to lock + update a tuple, no "value locks" were held, and neither were any B-Tree page buffer pins, because they were both released at the same time (recall that I call my heavyweight lock on B-Tree leaf pages a value lock). We still do that (unprincipled deadlocks are our only alternative), but now hold on to the pin for longer, until after tuple locking. Old versions of this patch used to sit on the B-Tree buffer pin to prevent concurrent deletion only as long as value locks were held, but maybe it isn't good enough to sit on the pin until before we lock/update, as value locks are released: dropping the pin implies that the heap tuple can physically go away, and in general the same TID may then contain anything. We may have to interlock against vacuum by sitting on the B-Tree buffer pin (but not the value lock) throughout locking + update. That makes it impossible for the heap tuple slot to fail to relate to the tuple from the B-Tree, that is under consideration for locking/updating. Recall that we aren't quite dealing with MVCC semantics here, since in READ COMMITTED mode we can lock a conclusively committed + visible tuple with *no* version visible to our command's MVCC snapshot. Therefore, it seems worth considering the possibility that the nbtree README's observations on the necessity of holding a pin to interlock against VACUUM (for non-MVCC snapshots) apply. In this revision we have two callbacks (or two calls to the same callback, with different effects): One to release value locks early, to avoid unprincipled deadlocks, and a second to finally release the last unneeded buffer pin. Recall that when we find a conflict (within _bt_checkunique()), it must be conclusively committed and visible to new MVCC snapshots; we know at that juncture that it's live. The concern is that it might be deleted *and* garbage collected in the interim between finding the conflict tuple, and locking it (in practice this interim period is only an instant). This is probably too paranoid, though: the fact that the upserter's transaction is running ought to imply that GetOldestXmin() returns an XID sufficient to prevent this. OTOH, I'm not sure that there exists anything that looks like a precedent for relying on blocking vacuum in this manner, and it might turn out to be limiting to rely on this. And, I hasten to add, my fix (sitting on a B-Tree pin throughout row locking) is in another way perhaps not paranoid enough: Who is to say that our conflicting value is on the same B-Tree leaf page as our value lock? If might not be, since _bt_checkunique() looks at later B-Tree pages (the value locked page is merely "the first leaf page the value could be on"). Pinning the heavyweight lock page's buffer is certainly justified by the need for non-speculative inserters to see a flag that obligates them to acquire the heavyweight page lock themselves (see comments in patch for more), but this other reason is kind of dubious. In other words: I'm relying on the way VACUUM actually works to prevent premature garbage collection. It's possible to imagine a world in which HeapTupleSatisfiesVacuum() is smart enough to realize that the tuple UPSERT wants to lock is not visible to anyone (assuming MVCC semantics, etc), and never can be. I've tentatively added code to keep a buffer pin for longer, but that's probably not good enough if we assume that it's necessary at all. Basically, I want to be comfortable about my rationale for it being okay that a "non-MVCC" "index scan" doesn't hold a pin, but right now I'm not. I was conflicted on whether or not I should include the "unpin later" logic at all; for now I've left it in, if only as a placeholder. Needless to say, if there is a race condition you can take it that it's very difficult to isolate. FWIW, somewhat extensive stress-testing has revealed no bugs that you might associate with these problems, with and without extended buffer pinning, and with artificial random sleeps added at key points in an effort to make any race condition bugs manifest themselves. I have made a concerted effort to break the patch in that way, and I'm now running out of ideas. Running the stress tests (with random delays in key points in the code) for several days reveals no bugs. This is on the same dedicated 8 core server, with plenty of concurrency. It's probably a good idea to begin using my B-Tree verification tool [7] for testing...on the other hand, it doesn't know anything about MVCC, and will only detect the violation of invariants that are localized to the B-Tree code, at least at the moment. Open items ========= I already mentioned the inability to reference rejected rows in an UPDATE, as well as my unease about VACUUM interlocking, both of which are open item. Also, some of the restrictions that I already mentioned - on updatable views, inheritance, and foreign tables - are probably unnecessary. We should be able to come with reasonable behavior for at least some of those. Patch ==== I thought that I went too long without posting something about all of this to the list to get feedback, and so I decided to post this WIP patch set. I've tried to break it up into pieces, but it isn't all that suitable for representing as cumulative commits. I've also tried to break up the discussion usefully (the question of how everything fits together at a high level can hopefully be discussed separately from the question of how "value locks" are actually implemented). Thoughts? [1] http://www.pgcon.org/2014/schedule/attachments/327_upsert_weird.pdf, ("Goals for UPSERT in Postgres") [2] http://www.postgresql.org/message-id/CAM3SWZRP0c3g6+aJ=YYDGYAcTZg0xA8-1_FCVo5Xm7hrEL34kw@mail.gmail.com [3] https://sqlite.org/lang_conflict.html [4] http://www.postgresql.org/message-id/CAM3SWZQoArVQGMi=v-jk3sBjsPg+wdjeUkM_6L5TZG_i9pyGzQ@mail.gmail.com [5] http://www.postgresql.org/message-id/52B4AAF0.5090806@vmware.com [6] http://www.postgresql.org/message-id/CAM3SWZShbE29KpoD44cVc3vpZJGmDer6k_6FGHiSzeOZGmTFSQ@mail.gmail.com [7] http://www.postgresql.org/message-id/CAM3SWZRtV+xmRWLWq6c-x7czvwavFdwFi4St1zz4dDgFH4yN4g@mail.gmail.com -- Peter Geoghegan
Вложения
В списке pgsql-hackers по дате отправления: