Обсуждение: UPSERT wiki page, and SQL MERGE syntax
Having been surprisingly successful at advancing our understanding of arguments for and against various approaches to "value locking", I decided to try the same thing out elsewhere. I have created a general-purpose UPSERT wiki page. The page is: https://wiki.postgresql.org/wiki/UPSERT Right now, I would like to address the less contentious but still unresolved question of whether or not we should use the SQL-standard MERGE syntax instead of the non-standard INSERT ... ON CONFLICT UPDATE syntax that my patch proposes. Note that I'm trying to direct the conversation along certain lines: Is MERGE the right syntax for this particular effort ("UPSERT")? And, in particular, not: Is SQL MERGE useful? I think that it is useful, but is a largely unrelated feature. Please correct the Wiki page if I have failed to credit SQL MERGE with some advantage that someone stated at some point. I don't recall any arguments for it, other than that it is in the SQL standard, but maybe I missed one. In general, add to this page, and edit it as you see fit. It'll be useful to centralize the references, discussion and state of the patch in one agreed upon place, as the patch continues to evolve. -- Peter Geoghegan
On 10/02/2014 07:52 AM, Peter Geoghegan wrote: > Having been surprisingly successful at advancing our understanding of > arguments for and against various approaches to "value locking", I > decided to try the same thing out elsewhere. I have created a > general-purpose UPSERT wiki page. > > The page is: https://wiki.postgresql.org/wiki/UPSERT Thanks. That'll help keep things moving forward rather than around in circles. > In general, add to this page, and edit it as you see fit. It'll be > useful to centralize the references, discussion and state of the patch > in one agreed upon place, as the patch continues to evolve. I added a summary of the status quo of upsert in Pg as it stands, and a brief discussion of the state in other RDBMSes. I'd love it if someone who knows MySQL better could add info on MySQL's ON DUPLICATE KEY feature - advantages/problems, etc. I've added a few points to the goals section: - Any new upsert approach must be a usability improvement on the status quo; we don't want to introduce subtle behaviour or unnecessary foot-guns. - if possible, upsert of multiple values is desirable. We currently have to loop on a per-value basis. ... and some miscellaneous edits/formatting changes. I've also added sections for the other options: * A custom syntax https://wiki.postgresql.org/wiki/UPSERT#Custom_syntax and * Adopting an existing non-standard syntax https://wiki.postgresql.org/wiki/UPSERT#Adopting_an_existing_non-standard_syntax which I'd appreciate it if you could fill out based on your existing research and notes. I don't think it makes sense for me to write those when you've already done the required study/note taking and just need to transfer it over. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 10/02/2014 02:52 AM, Peter Geoghegan wrote: > Having been surprisingly successful at advancing our understanding of > arguments for and against various approaches to "value locking", I > decided to try the same thing out elsewhere. I have created a > general-purpose UPSERT wiki page. > > The page is: https://wiki.postgresql.org/wiki/UPSERT Thanks! Could you write down all of the discussed syntaxes, using a similar notation we use in the manual, with examples on how to use them? And some examples on what is possible with some syntaxes, and not with others? That would make it a lot easier to compare them. - Heikki
On Thu, Oct 2, 2014 at 12:07 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Could you write down all of the discussed syntaxes, using a similar notation > we use in the manual, with examples on how to use them? And some examples on > what is possible with some syntaxes, and not with others? That would make it > a lot easier to compare them. I've started off by adding varied examples of the use of the existing proposed syntax. I'll expand on this soon. -- Peter Geoghegan
On Thu, Oct 2, 2014 at 12:54 AM, Peter Geoghegan <pg@heroku.com> wrote: > I've started off by adding varied examples of the use of the existing > proposed syntax. I'll expand on this soon. I spent some time today expanding on the details, and commenting on the issues around the custom syntax (exactly what it does, issues around ambiguous conflict-on unique indexes, etc). Also, "Challenges, issues" has been updated - I've added some "secondary concerns". These are non-contentious items that I think are of concern, and would like to highlight now. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> wrote: > The page is: https://wiki.postgresql.org/wiki/UPSERT Thanks! I have two questions I hope you can clarify. I'm having trouble parsing what this statement means: > ... the SQL standard does not require that MERGE be atomic in the > sense of atomically providing either an INSERT or UPDATE, ... My understanding is that the standard logically requires (without concern for implementation details) that the second specified table (via table name or subquery -- which could be a VALUES statement) is scanned, and for each row it attempts to match a row in the target table. That will either match or not, according to the boolean expression in the ON clause. You can have one WHEN MATCHED THEN UPDATE clause and/or one WHEN NOT MATCHED THEN INSERT clause. If both clauses are present, I believe that it is guaranteed that one or the other (but not both) will fire for each row in the second table. The usual guarantees for each isolation level must not be violated (although an implementation is not required to generate anomalies which could not happen with serial execution). So as usual for a transaction involving multiple tables, serialization anomalies are possible if the transaction isolation level is REPEATABLE READ or less. Is that what you're getting at, or something else? Regarding this example: > -- Predicate within UPDATE auxiliary statement > -- (row is still locked when the UPDATE predicate > -- isn't satisfied): > INSERT INTO upsert(key, val) VALUES(1, 'insert') > -- CONFLICTING() carries forward effects of both INSERT and UPDATE BEFORE row-level triggers > ON CONFLICT UPDATE SET val = CONFLICTING(val) > -- Predicate has interesting semantics visibility-wise: > WHERE val != 'delete'; Can you explain what the WHERE clause there does? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 3, 2014 at 1:16 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > I have two questions I hope you can clarify. I'm having trouble > parsing what this statement means: > >> ... the SQL standard does not require that MERGE be atomic in the >> sense of atomically providing either an INSERT or UPDATE, ... > > My understanding is that the standard logically requires (without > concern for implementation details) that the second specified table > (via table name or subquery -- which could be a VALUES statement) > is scanned, and for each row it attempts to match a row in the > target table. That will either match or not, according to the > boolean expression in the ON clause. You can have one WHEN MATCHED > THEN UPDATE clause and/or one WHEN NOT MATCHED THEN INSERT clause. > If both clauses are present, I believe that it is guaranteed that > one or the other (but not both) will fire for each row in the > second table. The usual guarantees for each isolation level must > not be violated (although an implementation is not required to > generate anomalies which could not happen with serial execution). That seems like a statement that isn't getting to the heart of the matter. Which is: simple UPSERT cases will get things like duplicate key violations under concurrency with MERGE, in SQL Server, Oracle, and DB2. I think serialization failures are inevitable and appropriate when not using READ COMMITTED mode with MVCC, but these duplicate violations are sort of like READ COMMITTED serialization failures in disguise. Or they would be, if MERGE (as described by the standard, or in practice) promised to care about atomicity in that sense. It doesn't, so they're off the hook. I think we should care about atomicity (in the sense of always getting an insert or update, never an error at RC) for the simple reason that that's what people want. It seems like your remarks here don't acknowledge the reality: That there are slightly different ideas of "each row" that are in play here. > So as usual for a transaction involving multiple tables, > serialization anomalies are possible if the transaction isolation > level is REPEATABLE READ or less. Is that what you're getting at, > or something else? My complaint is quite straightforward, really. I don't want to have to tell users to do this: http://stackoverflow.com/a/22777749 At the same time, I also don't want to have to live with the consequences of implementing a MERGE that does not exhibit that behavior. Which is to say, the consequences of doing something like selectively using different types of snapshots (MVCC or dirty - the two different ideas of "each row" that are in tension) based on the exact clauses used. That seems like asking for trouble, TBH. >> -- Predicate within UPDATE auxiliary statement >> -- (row is still locked when the UPDATE predicate >> -- isn't satisfied): >> INSERT INTO upsert(key, val) VALUES(1, 'insert') >> -- CONFLICTING() carries forward effects of both INSERT and UPDATE BEFORE row-level triggers >> ON CONFLICT UPDATE SET val = CONFLICTING(val) >> -- Predicate has interesting semantics visibility-wise: >> WHERE val != 'delete'; > > Can you explain what the WHERE clause there does? Sure. Having already locked the tuple on the basis on finding a conflict TID, but before an UPDATE, the qual is evaluated using EvalPlanQual rescanning. The predicate must be satisfied in order for the UPDATE to actually proceed. If, in this instance, the locked tuple happened to have a val of 'delete', then we would not UPDATE at all. It would still be locked, though (possibly without being visible to our MVCC snapshot, just like when we might do that when we do UPDATE). There are some interesting implications visibility-wise here that must be considered. These are discussed on the Wiki page: https://wiki.postgresql.org/wiki/UPSERT#Visibility_issues_and_the_proposed_syntax_.28WHERE_clause.2Fpredicate_stuff.29 Obviously higher isolation levels just throw an error here instead. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Oct 3, 2014 at 1:16 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> I'm having trouble parsing what this statement means: >> >>> ... the SQL standard does not require that MERGE be atomic in >>> the sense of atomically providing either an INSERT or UPDATE, ... > ... always getting an insert or update, never an error ... I've never seen "atomic" used to mean "you never get an error" before. Perhaps it would be clearer to replace "atomic" with "error-free" or some such. It certainly would be less confusing for me. "Atomic" in most cases is a property that can be enforced by generating an error where it would otherwise be violated. For example: http://en.wikipedia.org/wiki/Atomicity_%28database_systems%29 "Although implementations vary depending on factors such as concurrency issues, the principle of atomicity — i.e. complete success or complete failure — remain." When you are saying "atomic" you mean something quite different. > My complaint is quite straightforward, really. I don't want to > have to tell users to do this: http://stackoverflow.com/a/22777749 I think you are confusing syntax with semantics. I grant that a reasonable person could have concerns about using the MERGE syntax to implement the semantics you want in the special case that an appropriate unique index exists, but pretending that the semantics can't be achieved with that syntax is just silly. > At the same time, I also don't want to have to live with the > consequences of implementing a MERGE that does not exhibit that > behavior. Which is to say, the consequences of doing something > like selectively using different types of snapshots (MVCC or > dirty - the two different ideas of "each row" that are in > tension) based on the exact clauses used. That seems like asking > for trouble, TBH. Now *that* is getting more to a real issue. We routinely pick very different plans based on the presence or absence of an index, and we use special snapshots in the course of executing many DML statements (if FK triggers are fired), but this would be the first place I can think of that the primary DML statement behaves that way. You and a couple others have expressed concern about that, but it seems pretty vague and hand-wavey. If a different execution node is used for the special behavior, and that node is generated based on the unique index being used, I'm really having problems seeing where the problem lies. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 3, 2014 at 2:44 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > I've never seen "atomic" used to mean "you never get an error" > before. > When you are saying "atomic" you mean something quite different. Perhaps I should have been more careful on that point. Just to be crystal clear: I don't intend that you'll never get an error (some cases clearly *should* error). But I want confidence that reasonable, simple UPSERTs will never error due to a duplicate violation that they thought the statement was supposed to avoid. That's the point of UPSERT. Guarantees matter. I would like to make guarantees that are at least as good as the upsert looping subxact pattern. Teradata refers to an "atomic UPSERT". It also has a MERGE command. >> My complaint is quite straightforward, really. I don't want to >> have to tell users to do this: http://stackoverflow.com/a/22777749 > > I think you are confusing syntax with semantics. I grant that a > reasonable person could have concerns about using the MERGE syntax > to implement the semantics you want in the special case that an > appropriate unique index exists, but pretending that the semantics > can't be achieved with that syntax is just silly. I am not pretending that it can't be done - just, as I said, that to do so would have bad consequences when time came to generalize the syntax to support more advanced cases. >> At the same time, I also don't want to have to live with the >> consequences of implementing a MERGE that does not exhibit that >> behavior. Which is to say, the consequences of doing something >> like selectively using different types of snapshots (MVCC or >> dirty - the two different ideas of "each row" that are in >> tension) based on the exact clauses used. That seems like asking >> for trouble, TBH. > > Now *that* is getting more to a real issue. Yes, it is. I am opposed to using the MERGE syntax for this *because* MERGE is useful (independently useful, when done properly), not because it is not useful. > We routinely pick very > different plans based on the presence or absence of an index, and > we use special snapshots in the course of executing many DML > statements (if FK triggers are fired), but this would be the first > place I can think of that the primary DML statement behaves that > way. You and a couple others have expressed concern about that, > but it seems pretty vague and hand-wavey. If a different execution > node is used for the special behavior, and that node is generated > based on the unique index being used, I'm really having problems > seeing where the problem lies. We need to avoid duplicate violations. The authoritative source of truth here is a unique index, because the presence or absence of a conclusively-visible conflict tuple controls whether our insert fails or succeeds respectively. We need a dirty snapshot to see tuples not in our MVCC snapshot, since they need to be handled too. Otherwise, when we fail to handle them, the MERGE's INSERT has a dup violation (which is not what I want, for practical reasons). If we're seq scanning a table, even with a dirty snapshot, there is no authoritative source of truth. Heap tuples will be inserted concurrently, and we'll have no idea what to do about it without reference to a unique index as a choke point/single source of final truth about what is and isn't going to be duplicated. As implemented, my UPSERT patch will have UPSERTs not really use their MVCC snapshot for simple cases, just like INSERTs in general. UPDATEs always use their MVCC snapshot, as the ModifyTable node pulls up tuples from an ordinary relation scan. UPSERT needs a DirtySnapshot scan of the unique index to find duplicates, because that's always how we find duplicates. At the same time, MERGE is not at liberty to not work just because of the absence of a unique index (plus even row locking must be prepared for conflicts). And since the MERGE is driven by an outer join - *not* an INSERT-like search for duplicates - we'll find ourselves providing one behavior when detecting one case, and another when detecting the other, with subtle and confusing differences between each case. If we "fake" an outer join by doing an INSERT-like search for duplicates in a unique index (to make sure we see duplicates), then we get into trouble elsewhere. Like, we cannot let users not ask for an INSERT in their MERGE, because we'd be conclusively deciding to *not* (say) UPDATE on the basis of the absence of a value in a unique index, as opposed to the presence. Suddenly, we're in the business of proving a negative....or are we?. My head hurts. -- Peter Geoghegan
On Fri, Oct 3, 2014 at 3:42 PM, Peter Geoghegan <pg@heroku.com> wrote: > Yes, it is. I am opposed to using the MERGE syntax for this *because* > MERGE is useful (independently useful, when done properly), not > because it is not useful. As I've mentioned, there is also the practical argument: No one else does this. That suggests to me that it's hard. -- Peter Geoghegan
On Fri, Oct 3, 2014 at 3:42 PM, Peter Geoghegan <pg@heroku.com> wrote: >> We routinely pick very >> different plans based on the presence or absence of an index, and >> we use special snapshots in the course of executing many DML >> statements (if FK triggers are fired) Apart from FK snapshots, we also use dirty snapshots for unique index enforcement, obviously. That doesn't mean that it's the "command/xact snapshot", though. It's just a special constraint enforcement mechanism. -- Peter Geoghegan
On Fri, Oct 3, 2014 at 4:16 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> ... the SQL standard does not require that MERGE be atomic in the >> sense of atomically providing either an INSERT or UPDATE, ... > > My understanding is that the standard logically requires (without > concern for implementation details) that the second specified table > (via table name or subquery -- which could be a VALUES statement) > is scanned, and for each row it attempts to match a row in the > target table. That will either match or not, according to the > boolean expression in the ON clause. You can have one WHEN MATCHED > THEN UPDATE clause and/or one WHEN NOT MATCHED THEN INSERT clause. > If both clauses are present, I believe that it is guaranteed that > one or the other (but not both) will fire for each row in the > second table. The usual guarantees for each isolation level must > not be violated (although an implementation is not required to > generate anomalies which could not happen with serial execution). > So as usual for a transaction involving multiple tables, > serialization anomalies are possible if the transaction isolation > level is REPEATABLE READ or less. Is that what you're getting at, > or something else? I think the problem is that it's not possible to respect the "usual guarantees" even at READ COMMITTED level when performing an INSERT OR UPDATE operation (however spelled). You may find that there's a tuple with the same PK which is committed but not visible to the snapshot you took at the beginning of the statement. An INSERT would fail; an UPDATE would see no rows and do nothing. IOW, *neither* operation succeeds according to its classic semantics; to succeed, the INSERT OR UPDATE has to do something altogether novel. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Oct 6, 2014 at 8:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think the problem is that it's not possible to respect the "usual > guarantees" even at READ COMMITTED level when performing an INSERT OR > UPDATE operation (however spelled). That's definitely the main problem. Also, there could be garden variety race conditions. -- Peter Geoghegan
On Mon, Oct 6, 2014 at 8:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think the problem is that it's not possible to respect the "usual > guarantees" even at READ COMMITTED level when performing an INSERT OR > UPDATE operation (however spelled). You may find that there's a tuple > with the same PK which is committed but not visible to the snapshot > you took at the beginning of the statement. Can you please comment on this, Kevin? It would be nice to converge on an agreement on syntax here (without necessarily working out the exact details right away, including for example the exact spelling of what I'm calling WITHIN). Since Simon has changed his mind on MERGE (while still wanting to retain a superficial flavor of MERGE, a flavor that does not include the MERGE keyword), I think that leaves you as the only advocate for the MERGE syntax. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Oct 6, 2014 at 8:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think the problem is that it's not possible to respect the "usual >> guarantees" even at READ COMMITTED level when performing an INSERT OR >> UPDATE operation (however spelled). You may find that there's a tuple >> with the same PK which is committed but not visible to the snapshot >> you took at the beginning of the statement. > > Can you please comment on this, Kevin? It would be nice to converge on > an agreement on syntax here Robert said "however spelled" -- which I take to mean that he at least sees that the MERGE-like UPSERT syntax can be turned into the desired semantics. I have no idea why anyone would think otherwise. Although the last go-around does suggest that there is at least one point of difference on the semantics. You seem to want to fire the BEFORE INSERT triggers before determining whether this will be an INSERT or an UPDATE. That seems like a bad idea to me, but if the consensus is that we want to do that, it does argue for your plan of making UPSERT a variant of the INSERT command. That doesn't seem as clean to me as saying (for example): UPSERT targettable t USING (VALUES (123, 100)) x(id, quantity) ON t.id = x.id WHEN NOT MATCHED INSERT (id, quantity, inserted_at)VALUES (x.id, x.quantity, now()) WHEN MATCHED UPDATE SET quantity = t.quantity + x.quantity, updated_at = now(); As I understand it you are proposing that would be: INSERT INTO targettable(key, quantity, inserted_at) VALUES(123, quantity, now()) ON CONFLICT WITHIN targettable_pkey UPDATESET quantity = quantity + CONFLICTING(quantity), updated_at = now(); -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 8, 2014 at 2:01 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > Although the last go-around does suggest that there is at least one > point of difference on the semantics. You seem to want to fire the > BEFORE INSERT triggers before determining whether this will be an > INSERT or an UPDATE. That seems like a bad idea to me, but if the > consensus is that we want to do that, it does argue for your plan > of making UPSERT a variant of the INSERT command. Well, it isn't that I'm doing it because I think that it is a great idea, with everything to recommend it. It's more like I don't see any practical alternative. We need the before row insert triggers to fire to figure out what to insert (or if we should update instead). No way around that. At the same time, those triggers are at liberty to do almost anything, and so in general we have no way of totally nullifying their effects (or side effects). Surely you see the dilemma. > As I understand it you are proposing that would be: > > INSERT INTO targettable(key, quantity, inserted_at) > VALUES(123, quantity, now()) > ON CONFLICT WITHIN targettable_pkey > UPDATE SET quantity = quantity + CONFLICTING(quantity), updated_at = now(); That's right, yes. -- Peter Geoghegan
On Thu, Oct 9, 2014 at 1:51 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Oct 8, 2014 at 2:01 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> Although the last go-around does suggest that there is at least one >> point of difference on the semantics. You seem to want to fire the >> BEFORE INSERT triggers before determining whether this will be an >> INSERT or an UPDATE. That seems like a bad idea to me Indeed, the current behavior breaks even the canonical "keep track of how many posts are in a thread" trigger example because INSERT triggers are fired for an effective row UPDATE. I can't see how that's acceptable. > Well, it isn't that I'm doing it because I think that it is a great > idea, with everything to recommend it. It's more like I don't see any > practical alternative. I proposed an alternative that avoids this surprise and might allow some other benefits. Can you please look into that? Even a "clarify this", "this couldn't possibly work" or "you're not a major contributor so your arguments are invalid" response. ;) I admit I initially misunderstood some details about the current behavior. I can dig into the code and write a clearer and/or more detailed spec if I had even *some* feedback. Regards, Marti
On Wed, Oct 8, 2014 at 4:30 PM, Marti Raudsepp <marti@juffo.org> wrote: > On Thu, Oct 9, 2014 at 1:51 AM, Peter Geoghegan <pg@heroku.com> wrote: >> On Wed, Oct 8, 2014 at 2:01 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >>> Although the last go-around does suggest that there is at least one >>> point of difference on the semantics. You seem to want to fire the >>> BEFORE INSERT triggers before determining whether this will be an >>> INSERT or an UPDATE. That seems like a bad idea to me > > Indeed, the current behavior breaks even the canonical "keep track of > how many posts are in a thread" trigger example because INSERT > triggers are fired for an effective row UPDATE. I can't see how that's > acceptable. DB2 had the foresight to add the following restrictions to BEFORE ROW triggers - they cannot: """ Contain any INSERT, DELETE, or UPDATE operations, nor invoke any routine defined with MODIFIES SQL DATA, if it is not a compound SQL. Contain any DELETE or UPDATE operations on the trigger subject table, nor invoke any routine containing such operations, if it is a compound SQL. Reference a materialized query table defined with REFRESH IMMEDIATE Reference a generated column other than the identity column in the NEW transition variable. """ To get a sense of how doing fancy things in before triggers leads to trouble, considering the hardening Kevin added in commit 6868ed74. In short, use an AFTER trigger for this kind of thing, and all of these issues go away. >> Well, it isn't that I'm doing it because I think that it is a great >> idea, with everything to recommend it. It's more like I don't see any >> practical alternative. > > I proposed an alternative that avoids this surprise and might allow > some other benefits. Can you please look into that? I looked at that. You're talking about making the case where before row triggers clearly are appropriate (when you just want to change the tuple being inserted) fail, in order to make an arguably inappropriate case for before row triggers work (when you don't change the tuple to be inserted, but you do want to do some other thing). That seems backwards. My proposed behavior seems far less surprising. -- Peter Geoghegan
On Thu, Oct 9, 2014 at 2:46 AM, Peter Geoghegan <pg@heroku.com> wrote: >> Indeed, the current behavior breaks even the canonical "keep track of >> how many posts are in a thread" trigger example > use an AFTER trigger for this kind of thing, and all of these > issues go away. In the latest patches from CommitFest, the UPDATE case invokes BEFORE&AFTER INSERT triggers, not AFTER UPDATE. Is that going to be changed? If so, I concede that. >> I proposed an alternative that avoids this surprise and might allow >> some other benefits. Can you please look into that? > > I looked at that. Thanks! > You're talking about making the case where before > row triggers clearly are appropriate (when you just want to change the > tuple being inserted) fail Only in case the trigger changes *key* columns necessary for atomicity (i.e. from the WITHIN index). Other columns are fair game. The restriction seems justifiable to me: it's unreasonable to be atomic with respect to values that change mid-way. * It can distinguish between BEFORE INSERT and BEFORE UPDATE triggers, which your approach fundamentally cannot. * It can probably avoid evaluating column defaults in the UPDATE case, like nextval() for unrelated serial columns => reduced overhead * The semantics match better with MERGE-like syntax that seems to come up again and again. * And it appears to solve (part?) of the problem you had with naming key *colums* in the WITHIN clause, rather than using index name. If you don't see any reasons why it can't be done, these benefits seem clear to me. I think the tradeoffs at least warrant wider discussion. Regards, Marti
On Wed, Oct 8, 2014 at 5:37 PM, Marti Raudsepp <marti@juffo.org> wrote: > Only in case the trigger changes *key* columns necessary for atomicity > (i.e. from the WITHIN index). Other columns are fair game. The > restriction seems justifiable to me: it's unreasonable to be atomic > with respect to values that change mid-way. > If you don't see any reasons why it can't be done, these benefits seem > clear to me. I think the tradeoffs at least warrant wider discussion. I don't. That's very surprising. One day, it will fail unexpectedly. As proposed, the way BEFORE INSERT triggers fire almost forces users to consider the issues up-front. Note that the CONFLICTING() behavior with respect to BEFORE INSERT triggers work's the same as MySQL's "INSERT ON DUPLICATE KEY UPDATE foo = VALUES(foo)" thing. There was agreement that that was the right behavior, it seemed. -- Peter Geoghegan
On Thu, Oct 9, 2014 at 3:47 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Oct 8, 2014 at 5:37 PM, Marti Raudsepp <marti@juffo.org> wrote: >> Only in case the trigger changes *key* columns necessary for atomicity >> (i.e. from the WITHIN index). Other columns are fair game. The >> restriction seems justifiable to me: it's unreasonable to be atomic >> with respect to values that change mid-way. Oh, one more consideration: I believe you will run into the same issue if you want to implement BEFORE UPDATE triggers in any form. Skipping BEFORE UPDATE entirely seems to violate POLA. It's common for applications to e.g. use triggers to keep track of latest modified time for a row. With your proposal, every query needs to include logic for that to work. >> If you don't see any reasons why it can't be done, these benefits seem >> clear to me. I think the tradeoffs at least warrant wider discussion. > > I don't. That's very surprising. One day, it will fail unexpectedly. > As proposed, the way BEFORE INSERT triggers fire almost forces users > to consider the issues up-front. Not necessarily "up-front", as proposed it causes existing triggers to change behavior when users adopt the upsert feature. And that adoption may even be transparent to the user due to ORM magic. There are potential surprises with both approaches. Personally I prefer a hard error instead of silently skipping logic that previously worked. And that seems to match general PostgreSQL philosophy too. > Note that the CONFLICTING() behavior with respect to BEFORE INSERT > triggers work's the same as MySQL's "INSERT ON DUPLICATE KEY UPDATE > foo = VALUES(foo)" thing. There was agreement that that was the right > behavior, it seemed. MySQL gets away with lots of things, they have several other caveats with triggers. I don't think it's a good example to follow wrt trigger behavior. Regards, Marti
On Wed, Oct 8, 2014 at 6:12 PM, Marti Raudsepp <marti@juffo.org> wrote: > Oh, one more consideration: I believe you will run into the same issue > if you want to implement BEFORE UPDATE triggers in any form. Skipping > BEFORE UPDATE entirely seems to violate POLA. Good thing that the patch doesn't do that, then. I clearly documented this in a few places, including: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/trigger-definition.html > It's common for applications to e.g. use triggers to keep track of > latest modified time for a row. With your proposal, every query needs > to include logic for that to work. Wrong. >>> If you don't see any reasons why it can't be done, these benefits seem >>> clear to me. I think the tradeoffs at least warrant wider discussion. >> >> I don't. That's very surprising. One day, it will fail unexpectedly. >> As proposed, the way BEFORE INSERT triggers fire almost forces users >> to consider the issues up-front. > > Not necessarily "up-front", as proposed it causes existing triggers to > change behavior when users adopt the upsert feature. And that adoption > may even be transparent to the user due to ORM magic. > > There are potential surprises with both approaches. When you make the slightest effort to understand what my approach is, I might take your remarks seriously. >> Note that the CONFLICTING() behavior with respect to BEFORE INSERT >> triggers work's the same as MySQL's "INSERT ON DUPLICATE KEY UPDATE >> foo = VALUES(foo)" thing. There was agreement that that was the right >> behavior, it seemed. > > MySQL gets away with lots of things, they have several other caveats > with triggers. I don't think it's a good example to follow wrt trigger > behavior. No true Scotsman. -- Peter Geoghegan
On Thu, Oct 9, 2014 at 4:25 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Oct 8, 2014 at 6:12 PM, Marti Raudsepp <marti@juffo.org> wrote: >> Skipping >> BEFORE UPDATE entirely seems to violate POLA. > Good thing that the patch doesn't do that, then. I clearly documented > this in a few places, including: > http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/trigger-definition.html I tried to understand the docs and found them somewhat confusing, so I turned to testing the implementation you posted on 2014-10-07. It shows no signs of running UPDATE triggers in the following case. So is this just a bug or a missing feature? create table evt_type (id serial primary key, name text unique, evt_count int, update_count int default 0); prepare upsert(text) as INSERT into evt_type (name, evt_count) values ($1, 1) on conflict within evt_type_name_key UPDATE set evt_count=conflicting(evt_count)+1; create or replace function ins() returns trigger language plpgsql as $$begin raise notice 'trigger % %', TG_WHEN, TG_OP; return new; end$$; create or replace function upd() returns trigger language plpgsql as $$begin raise notice 'trigger % %', TG_WHEN, TG_OP; new.update_count=new.update_count+1; return new; end$$; create trigger ev1 before insert on evt_type execute procedure ins(); create trigger ev2 before update on evt_type execute procedure upd(); create trigger ev3 after insert on evt_type execute procedure ins(); create trigger ev4 after update on evt_type execute procedure upd(); db=# execute upsert('foo'); NOTICE: trigger BEFORE INSERT NOTICE: trigger AFTER INSERT INSERT 0 1 db=# execute upsert('foo'); NOTICE: trigger BEFORE INSERT NOTICE: trigger AFTER INSERT INSERT 0 0 marti=# table evt_type;id | name | evt_count | update_count ----+------+-----------+-------------- 1 | foo | 2 | 0 >> MySQL gets away with lots of things, they have several other caveats > No true Scotsman. Eh? I'm just saying there may be good reasons not to imitate MySQL here. Regards, Marti
On Thu, Oct 9, 2014 at 4:56 AM, Marti Raudsepp <marti@juffo.org> wrote: > create trigger ev1 before insert on evt_type execute procedure ins(); > create trigger ev2 before update on evt_type execute procedure upd(); > create trigger ev3 after insert on evt_type execute procedure ins(); > create trigger ev4 after update on evt_type execute procedure upd(); Oooooops, I missed "for each row" here. Sorry for wasting your time. Regards, Marti
On Wed, Oct 8, 2014 at 7:04 PM, Marti Raudsepp <marti@juffo.org> wrote: > Oooooops, I missed "for each row" here. Note that I have an open item for what to do about statement level triggers here: https://wiki.postgresql.org/wiki/UPSERT I'm not satisfied with the current behavior. It needs more thought. But I think row level triggers are fine. -- Peter Geoghegan
On 10/8/14, 5:51 PM, Peter Geoghegan wrote: > On Wed, Oct 8, 2014 at 2:01 PM, Kevin Grittner<kgrittn@ymail.com> wrote: >> >Although the last go-around does suggest that there is at least one >> >point of difference on the semantics. You seem to want to fire the >> >BEFORE INSERT triggers before determining whether this will be an >> >INSERT or an UPDATE. That seems like a bad idea to me, but if the >> >consensus is that we want to do that, it does argue for your plan >> >of making UPSERT a variant of the INSERT command. > Well, it isn't that I'm doing it because I think that it is a great > idea, with everything to recommend it. It's more like I don't see any > practical alternative. We need the before row insert triggers to fire > to figure out what to insert (or if we should update instead). No way > around that. At the same time, those triggers are at liberty to do > almost anything, and so in general we have no way of totally > nullifying their effects (or side effects). Surely you see the > dilemma. FWIW, if each row was handled in a subtransaction then an insert that turned out to be an update could be rolled back...but the performance impact of going that route might be pretty horrid. :( There's also the potential to get stuckin a loop where a BEFORE INSERT trigger turns the tuple into an UPDATE and a BEFORE UPDATE trigger turns it into anINSERT. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 10/10/14 12:38, Jim Nasby wrote: > On 10/8/14, 5:51 PM, Peter Geoghegan wrote: >> On Wed, Oct 8, 2014 at 2:01 PM, Kevin Grittner<kgrittn@ymail.com> >> wrote: >>> >Although the last go-around does suggest that there is at least one >>> >point of difference on the semantics. You seem to want to fire the >>> >BEFORE INSERT triggers before determining whether this will be an >>> >INSERT or an UPDATE. That seems like a bad idea to me, but if the >>> >consensus is that we want to do that, it does argue for your plan >>> >of making UPSERT a variant of the INSERT command. >> Well, it isn't that I'm doing it because I think that it is a great >> idea, with everything to recommend it. It's more like I don't see any >> practical alternative. We need the before row insert triggers to fire >> to figure out what to insert (or if we should update instead). No way >> around that. At the same time, those triggers are at liberty to do >> almost anything, and so in general we have no way of totally >> nullifying their effects (or side effects). Surely you see the >> dilemma. > > FWIW, if each row was handled in a subtransaction then an insert that > turned out to be an update could be rolled back... but the performance > impact of going that route might be pretty horrid. :( There's also the > potential to get stuck in a loop where a BEFORE INSERT trigger turns > the tuple into an UPDATE and a BEFORE UPDATE trigger turns it into an > INSERT. Perhaps you need an UPSERT trigger?
On Wed, Oct 8, 2014 at 5:01 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > Peter Geoghegan <pg@heroku.com> wrote: >> On Mon, Oct 6, 2014 at 8:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I think the problem is that it's not possible to respect the "usual >>> guarantees" even at READ COMMITTED level when performing an INSERT OR >>> UPDATE operation (however spelled). You may find that there's a tuple >>> with the same PK which is committed but not visible to the snapshot >>> you took at the beginning of the statement. >> >> Can you please comment on this, Kevin? It would be nice to converge on >> an agreement on syntax here > > Robert said "however spelled" -- which I take to mean that he at > least sees that the MERGE-like UPSERT syntax can be turned into the > desired semantics. I have no idea why anyone would think otherwise. > > Although the last go-around does suggest that there is at least one > point of difference on the semantics. You seem to want to fire the > BEFORE INSERT triggers before determining whether this will be an > INSERT or an UPDATE. OK, I have a comment on this. Anything we do about triggers will by definition be novel. Right now, we have INSERT, UPDATE, and DELETE. If we add a new operation, whether it's called UPSERT or MERGE or FROB, or if we add a flag to insert that makes it possibly do something other than inserting (e.g. INSERT OR UPDATE), or if we add a flag to update that makes it do something other than updating (e.g. UPDATE or INSERT), then some people's triggers are going to get broken by that change no matter how we do it. When the new operation is invoked, we can fire the insert triggers, the update triggers, some new kind of trigger, or no trigger at all - and any decision we make there will not in all cases be backward-compatible. We can try to minimize the damage (and we probably should) but we can't make it go to zero. > INSERT INTO targettable(key, quantity, inserted_at) > VALUES(123, quantity, now()) > ON CONFLICT WITHIN targettable_pkey > UPDATE SET quantity = quantity + CONFLICTING(quantity), updated_at = now(); I actually like this syntax reasonably well in some ways, but I don't like that we're mentioning the index name, and the CONFLICTING() notation is decidedly odd. Maybe something like this: INSERT INTO targettable(key, quantity, inserted_at) VALUES(123, quantity, now()) ON DUPLICATE (key) UPDATE SET quantity= quantity + OLD.quantity, updated_at = now(); (Perhaps OLD should reference the tuple already in the table, and NEW the value from the VALUES clause. That would be snazzy indeed.) Also, how about making the SET clause optional, with the semantics that we just update all of the fields for which a value is explicitly specified: INSERT INTO overwrite_with_abandon (key, value) VALUES (42, 'meaning of life') ON DUPLICATE (key) UPDATE; While the ability to specify a SET clause there explicitly is useful, I bet a lot of key-value store users would love the heck out of a syntax that let them omit it when they want to overwrite. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 10, 2014 at 11:05 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Anything we do about triggers will by definition be novel. Right now, > we have INSERT, UPDATE, and DELETE. If we add a new operation, > whether it's called UPSERT or MERGE or FROB, or if we add a flag to > insert that makes it possibly do something other than inserting (e.g. > INSERT OR UPDATE), or if we add a flag to update that makes it do > something other than updating (e.g. UPDATE or INSERT), then some > people's triggers are going to get broken by that change no matter how > we do it. When the new operation is invoked, we can fire the insert > triggers, the update triggers, some new kind of trigger, or no trigger > at all - and any decision we make there will not in all cases be > backward-compatible. We can try to minimize the damage (and we > probably should) but we can't make it go to zero. +1. It's inevitable that someone isn't going to be entirely happy with whatever we do. Let's be realistic about that, while minimizing the risk. > I actually like this syntax reasonably well in some ways, but I don't > like that we're mentioning the index name, and the CONFLICTING() > notation is decidedly odd. Maybe something like this: People keep remarking that they don't like that you can (optionally) name a unique index explicitly, and I keep telling them why I've done it that way [1]. There is a trade-off here. I am willing to go another way in that trade-off, but let's have a realistic conversation about it. > Also, how about making the SET clause optional, with the semantics > that we just update all of the fields for which a value is explicitly > specified: > > INSERT INTO overwrite_with_abandon (key, value) > VALUES (42, 'meaning of life') > ON DUPLICATE (key) UPDATE; > > While the ability to specify a SET clause there explicitly is useful, > I bet a lot of key-value store users would love the heck out of a > syntax that let them omit it when they want to overwrite. While I initially like that idea, now I'm not so sure about it [2]. MySQL don't allow this (with ON DUPLICATE KEY UPDATE). Their REPLACE command allows it, since it is defined as a DELETE followed by an INSERT (or something like that), but I think that in general that feature has been a disaster for them. [1] http://www.postgresql.org/message-id/CAM3SWZQpGf6_ME6D1vWqdCFadD7Nprdx2JrE=Hcf=93KXeHY4g@mail.gmail.com [2] http://www.postgresql.org/message-id/CAM3SWZT=VXBJ7QKAidAmYbU40aP10udSqOOqhViX3Ykj7WBv9A@mail.gmail.com -- Peter Geoghegan
Robert Haas <robertmhaas@gmail.com> wrote: > Anything we do about triggers will by definition be novel. Right > now, we have INSERT, UPDATE, and DELETE. If we add a new > operation, whether it's called UPSERT or MERGE or FROB, [ ... ] If we call it MERGE, then we had better follow the rules the standard lays out for triggers on a MERGE statement -- which is that UPDATE triggers fire for rows updated and that INSERT triggers fire for rows inserted. That kinda makes sense, since the MERGE command is almost like a row-by-row CASE statement allowing one or the other. If we make up our own command verb, we are free to do as we like. > Maybe something like this: > > INSERT INTO targettable(key, quantity, inserted_at) > VALUES(123, quantity, now()) > ON DUPLICATE (key) > UPDATE SET quantity = quantity + OLD.quantity, updated_at = now(); That seems a lot cleaner than the proposal on the Wiki page. If we go that route, it makes sense to fire the BEFORE INSERT triggers before attempting the insert and then fire BEFORE UPDATE triggers before attempting the UPDATE. If either succeeds, I think we should fire the corresponding AFTER triggers. We already allow a BEFORE triggers to run and then omit the triggering operation without an error, so I don't see that as a problem. This makes a lot more sense to me than attempting to add a new UPSERT trigger type. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 10, 2014 at 11:30 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > That seems a lot cleaner than the proposal on the Wiki page. If we > go that route, it makes sense to fire the BEFORE INSERT triggers > before attempting the insert and then fire BEFORE UPDATE triggers > before attempting the UPDATE. If either succeeds, I think we > should fire the corresponding AFTER triggers. We already allow a > BEFORE triggers to run and then omit the triggering operation > without an error, so I don't see that as a problem. This makes a > lot more sense to me than attempting to add a new UPSERT trigger > type. You realize that that's exactly what my patch does, right? AFAICT the only confusion that my patch has is with statement-level triggers, as discussed on the Wiki page. I think that the row-level trigger behavior is fine; a lot more thought has gone into it. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Oct 10, 2014 at 11:05 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I actually like this syntax reasonably well in some ways, but I don't >> like that we're mentioning the index name, and the CONFLICTING() >> notation is decidedly odd. > > People keep remarking that they don't like that you can (optionally) > name a unique index explicitly, and I keep telling them why I've done > it that way [1]. There is a trade-off here. I am willing to go another > way in that trade-off, but let's have a realistic conversation about > it. We've all read that, and your repeated arguments for that point of view. We disagree and have said why. What in that is not a realistic conversation? To restate: to do so is conflating the logical definition of the database with a particular implementation detail. As just one reason that is a bad idea: we can look up unique indexes on the specified columns, but if we implement a other storage techniques where there is no such thing as a unique index on the columns, yet manage to duplicate the semantics (yes, stranger things have happened), people can't migrate to the new structure without rewriting their queries. If the syntax references logical details (like column names) there is no need to rewrite. We don't want to be painted into a corner. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 10, 2014 at 11:38 AM, Peter Geoghegan <pg@heroku.com> wrote: >> That seems a lot cleaner than the proposal on the Wiki page. If we >> go that route, it makes sense to fire the BEFORE INSERT triggers >> before attempting the insert and then fire BEFORE UPDATE triggers >> before attempting the UPDATE. By the way, there is no problem with failing to UPDATE, because we lock rows ahead of the UPDATE. Once a row is locked, the UPDATE cannot conflict. There is no danger of UPDATE before row-level triggers firing without then updating (unless the xact aborts, but you know what I mean). In general, there is no danger of triggers firing more often than you might consider that they should, with the sole exception of the fact that we always fire before insert row-level triggers, even when an UPDATE is the ultimate outcome. Restarting for conflicts (e.g. handling concurrent insertions with promise tuples) necessitates a restart, but to the point after before row-level insert triggers fire, and from a point before any other triggers have the opportunity to fire. -- Peter Geoghegan
On Fri, Oct 10, 2014 at 11:44 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > We've all read that, and your repeated arguments for that point of > view. We disagree and have said why. What in that is not a > realistic conversation? Because, as I've said many times, there are problems with naming columns directly that need to be addressed. There are corner-cases. Are you going to complain about those once I go implement something? Let's talk about that. I think we can just throw an error in these edge-cases, but let's decide if we're comfortable with that. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Oct 10, 2014 at 11:30 AM, Kevin Grittner <kgrittn@ymail.com> wrote: >> That seems a lot cleaner than the proposal on the Wiki page. If we >> go that route, it makes sense to fire the BEFORE INSERT triggers >> before attempting the insert and then fire BEFORE UPDATE triggers >> before attempting the UPDATE. If either succeeds, I think we >> should fire the corresponding AFTER triggers. We already allow a >> BEFORE triggers to run and then omit the triggering operation >> without an error, so I don't see that as a problem. This makes a >> lot more sense to me than attempting to add a new UPSERT trigger >> type. > > You realize that that's exactly what my patch does, right? I haven't read the patch, but the discussion has made that clear, yes. Try to take agreement on a point gracefully, will ya? ;-) > AFAICT the only confusion that my patch has is with statement-level > triggers, as discussed on the Wiki page. I think that the row-level > trigger behavior is fine; a lot more thought has gone into it. IMV it is clear that since the standard says that update or insert operations which affect zero rows fire the corresponding trigger, the statement level triggers for both should fire for an UPSERT, even if the set of affected rows for one or the other (or both!) is zero. If we ever get transition relations, it will be easy to check the count of affected rows or otherwise behave appropriately based on what was done. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Peter Geoghegan <pg@heroku.com> wrote: > There is no danger of UPDATE before row-level triggers firing > without then updating (unless the xact aborts, but you know what > I mean). Well, let's make sure I do know what you mean. If a BEFORE UPDATE ... FOR EACH ROW trigger returns NULL, the UPDATE will be skipped without error, right? The BEFORE UPDATE triggers will run before the UPDATE if a duplicate is found, and the return value will be treated in the usual manner, replacing values and potentially skipping the update? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 10, 2014 at 2:29 PM, Peter Geoghegan <pg@heroku.com> wrote: > People keep remarking that they don't like that you can (optionally) > name a unique index explicitly, and I keep telling them why I've done > it that way [1]. There is a trade-off here. I am willing to go another > way in that trade-off, but let's have a realistic conversation about > it. I think what's realistic here is that the patch isn't going to get committed the way you have it today, because nobody else likes that design. That may be harsh, but I think it's accurate. But on the substance of the issue, I'm totally unconvinced by your argument in the linked email. Let's suppose you have this: create table foo (a int, b int); create unique index foo1 on foo (a); create unique index foo2 on foo (a); create unique index foo3 on foo (a) where b = 1; Anything that conflicts in foo1 or foo2 is going to conflict in the other one, and with foo3. Anything that conflicts in foo3 is perhaps also going to conflict in foo1 and foo2, or perhaps not. Yet, if the statement simply says ON DUPLICATE (a) UPDATE, it's entirely clear what to do: when we hit a conflict in any of those three indexes, update the row. No matter which index has the conflict, updating is the right answer, and solves the conflict in all three indexes. I dunno what you'd expect someone to write in your syntax to solve this problem - presumably either (1) they can list any of the indexes that might conflict, (2) they must list all of the indexes that might conflict, or (3) it just doesn't work. Whatever the actual behavior of your patch is today, it seems to me that the conflict is, fundamentally, over a set of column values, NOT over a particular index. The index is simply a mechanism for noticing that conflict. If you want to allow this to work with expression indexes, that's not really a problem; you can let the user write INSERT .. ON CONFLICT (upper(foo)) UPDATE ... and match that to the index. It wouldn't bother me if the first version of this feature couldn't target expression indexes anyway, but if you want to include that, having the user mention the actual expression rather than the index name shouldn't make much difference. >> Also, how about making the SET clause optional, with the semantics >> that we just update all of the fields for which a value is explicitly >> specified: >> >> INSERT INTO overwrite_with_abandon (key, value) >> VALUES (42, 'meaning of life') >> ON DUPLICATE (key) UPDATE; >> >> While the ability to specify a SET clause there explicitly is useful, >> I bet a lot of key-value store users would love the heck out of a >> syntax that let them omit it when they want to overwrite. > > While I initially like that idea, now I'm not so sure about it [2]. > MySQL don't allow this (with ON DUPLICATE KEY UPDATE). Their REPLACE > command allows it, since it is defined as a DELETE followed by an > INSERT (or something like that), but I think that in general that > feature has been a disaster for them. Your syntax allows the exact same thing; it simply require the user to be more verbose in order to get that behavior. If we think that people wanting that behavior will be rare, then it's fine to require them to spell it out when they want it. If we think it will be the overwhelming common application of this feature, and I do, then making people spell it out when we could just as well infer it is pointless. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Oct 10, 2014 at 11:44 AM, Kevin Grittner <kgrittn@ymail.com> wrote: >> [discussion on committers rejecting the notion of a syntax >> involving specification of an index name] > as I've said many times, there are problems with naming > columns directly that need to be addressed. There are corner-cases. > Are you going to complain about those once I go implement something? If I don't like what I see, yes. I think it will be entirely possible for you to write something along those lines that I won't complain about. > I think we can just throw an error in these edge-cases, but let's > decide if we're comfortable with that. If the list of columns does not allow a choice of a suitable full-table unique index on only non-null columns, an error seems appropriate. What other problem cases do you see? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 10, 2014 at 11:55 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > I haven't read the patch, but the discussion has made that clear, > yes. Try to take agreement on a point gracefully, will ya? ;-) Heh, sorry. I did literally mean what I said - it wasn't 100% clear to me that you got that. It's safest to not make any assumptions like that. >> AFAICT the only confusion that my patch has is with statement-level >> triggers, as discussed on the Wiki page. I think that the row-level >> trigger behavior is fine; a lot more thought has gone into it. > > IMV it is clear that since the standard says that update or insert > operations which affect zero rows fire the corresponding trigger, > the statement level triggers for both should fire for an UPSERT, > even if the set of affected rows for one or the other (or both!) is > zero. If we ever get transition relations, it will be easy to > check the count of affected rows or otherwise behave appropriately > based on what was done. I think that you're probably right about that. Note that UPDATE rules will not be applied to the "UPDATE portion" of the statement. Not sure what to do about that, but I'm pretty sure that it's inevitable, because the UPDATE is effectively the same top-level statement for the purposes of rules. -- Peter Geoghegan
On Fri, Oct 10, 2014 at 12:04 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > Peter Geoghegan <pg@heroku.com> wrote: > >> There is no danger of UPDATE before row-level triggers firing >> without then updating (unless the xact aborts, but you know what >> I mean). > > Well, let's make sure I do know what you mean. If a BEFORE UPDATE > ... FOR EACH ROW trigger returns NULL, the UPDATE will be skipped > without error, right? The BEFORE UPDATE triggers will run before > the UPDATE if a duplicate is found, and the return value will be > treated in the usual manner, replacing values and potentially > skipping the update? That's exactly right, yes (in particular, you can return NULL from before row update triggers and have that cancel the update in the usual manner). And potentially, the final value could be affected by both the before row insert and before row update triggers (by way of CONFLICTING()). That's the most surprising aspect of what I've done (although I would argue it's the least surprising behavior possible given my constraints). -- Peter Geoghegan
On Fri, Oct 10, 2014 at 12:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I think what's realistic here is that the patch isn't going to get > committed the way you have it today, because nobody else likes that > design. That may be harsh, but I think it's accurate. I do think that's harsh, because it's unnecessary: I am looking for the best design. If you want to propose alternatives, great, but there is a reason why I've done things that way, and that should be acknowledged. I too think naming the unique index is ugly as sin, and have said as much multiple times. We're almost on the same page here. > But on the substance of the issue, I'm totally unconvinced by your > argument in the linked email. Let's suppose you have this: > > create table foo (a int, b int); > create unique index foo1 on foo (a); > create unique index foo2 on foo (a); > create unique index foo3 on foo (a) where b = 1; > > Anything that conflicts in foo1 or foo2 is going to conflict in the > other one, and with foo3. Anything that conflicts in foo3 is perhaps > also going to conflict in foo1 and foo2, or perhaps not. Yet, if the > statement simply says ON DUPLICATE (a) UPDATE, it's entirely clear > what to do: when we hit a conflict in any of those three indexes, > update the row. No matter which index has the conflict, updating is > the right answer, and solves the conflict in all three indexes. I > dunno what you'd expect someone to write in your syntax to solve this > problem - presumably either (1) they can list any of the indexes that > might conflict, (2) they must list all of the indexes that might > conflict, or (3) it just doesn't work. I expect them to name exactly one unique index. They should either do that explicitly, or have one in mind and make the behavior implicit at the risk of miscalculating (and having a surprising outcome). It doesn't matter if it's foo1 or foo2 in this example (but foo3 is different, obviously). > Whatever the actual behavior > of your patch is today, it seems to me that the conflict is, > fundamentally, over a set of column values, NOT over a particular > index. The index is simply a mechanism for noticing that conflict. I think that this is the kernel of our disagreement. The index is not simply a mechanism for noticing that conflict. The user had better have one unique index in mind to provoke taking the alternative path in the event of a would-be unique violation. Clearly it doesn't matter much in this particular example. But what if there were two partial unique indexes, that were actually distinct, but only in terms of the constant appearing in their predicate? And what about having that changed by a before insert row-level trigger? Are we to defer deciding which unique index to use at the last possible minute? As always with this stuff, the devil is in the details. If we work out the details, then I can come up with something that's acceptable to everyone. Would you be okay with this never working with partial unique indexes? That gives me something to work with. > If you want to allow this to work with expression indexes, that's not > really a problem; you can let the user write INSERT .. ON CONFLICT > (upper(foo)) UPDATE ... and match that to the index. It wouldn't > bother me if the first version of this feature couldn't target > expression indexes anyway, but if you want to include that, having the > user mention the actual expression rather than the index name > shouldn't make much difference. I'm not that worried about expression indexes, actually. I'm mostly worried about partial unique indexes, particularly when before insert row-level triggers are in play (making *that* play nice with expression indexes is harder still, but expression indexes on their own are probably not that much of a problem). >>> Also, how about making the SET clause optional, with the semantics >>> that we just update all of the fields for which a value is explicitly >>> specified: >>> >>> INSERT INTO overwrite_with_abandon (key, value) >>> VALUES (42, 'meaning of life') >>> ON DUPLICATE (key) UPDATE; > Your syntax allows the exact same thing; it simply require the user to > be more verbose in order to get that behavior. If we think that > people wanting that behavior will be rare, then it's fine to require > them to spell it out when they want it. If we think it will be the > overwhelming common application of this feature, and I do, then making > people spell it out when we could just as well infer it is pointless. Did you consider my example? I think that people will like this idea, too - that clearly isn't the only consideration, though. As you say, it would be very easy to implement this. However, IMV, we shouldn't, because it is hazardous. MySQL doesn't allow this, and they tend to find expedients like this useful. -- Peter Geoghegan
On 10/9/14, 6:59 PM, Gavin Flower wrote: > On 10/10/14 12:38, Jim Nasby wrote: >> On 10/8/14, 5:51 PM, Peter Geoghegan wrote: >>> On Wed, Oct 8, 2014 at 2:01 PM, Kevin Grittner<kgrittn@ymail.com> wrote: >>>> >Although the last go-around does suggest that there is at least one >>>> >point of difference on the semantics. You seem to want to fire the >>>> >BEFORE INSERT triggers before determining whether this will be an >>>> >INSERT or an UPDATE. That seems like a bad idea to me, but if the >>>> >consensus is that we want to do that, it does argue for your plan >>>> >of making UPSERT a variant of the INSERT command. >>> Well, it isn't that I'm doing it because I think that it is a great >>> idea, with everything to recommend it. It's more like I don't see any >>> practical alternative. We need the before row insert triggers to fire >>> to figure out what to insert (or if we should update instead). No way >>> around that. At the same time, those triggers are at liberty to do >>> almost anything, and so in general we have no way of totally >>> nullifying their effects (or side effects). Surely you see the >>> dilemma. >> >> FWIW, if each row was handled in a subtransaction then an insert that turned out to be an update could be rolled back...but the performance impact of going that route might be pretty horrid. :( There's also the potential to get stuckin a loop where a BEFORE INSERT trigger turns the tuple into an UPDATE and a BEFORE UPDATE trigger turns it into anINSERT. > Perhaps you need an UPSERT trigger? I would think that a BEFORE UPSERT trigger would very likely want to know whether we were inserting or updating, which basicallyputs us back where we started. That said, since the use case for UPSERT is different than both INSERT and UPDATE maybe it would be a good idea to have aseparate trigger for them anyway. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Oct 10, 2014 at 12:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think what's realistic here is that the patch isn't going to get >> committed the way you have it today, because nobody else likes that >> design. That may be harsh, but I think it's accurate. > > I do think that's harsh, because it's unnecessary: I am looking for > the best design. If you want to propose alternatives, great, but there > is a reason why I've done things that way, and that should be > acknowledged. I too think naming the unique index is ugly as sin, and > have said as much multiple times. We're almost on the same page here. I hope you can adjust to the feedback, because it would be disappointing to have this feature slip from the release, which is what will happen if the index name remains part of the syntax. > Would you be okay with this never working with partial unique > indexes? That gives me something to work with. That seems like the only sensible course, to me. >> If you want to allow this to work with expression indexes, that's not >> really a problem; you can let the user write INSERT .. ON CONFLICT >> (upper(foo)) UPDATE ... and match that to the index. It wouldn't >> bother me if the first version of this feature couldn't target >> expression indexes anyway, but if you want to include that, having the >> user mention the actual expression rather than the index name >> shouldn't make much difference. > > I'm not that worried about expression indexes, actually. I'm mostly > worried about partial unique indexes, particularly when before insert > row-level triggers are in play (making *that* play nice with > expression indexes is harder still, but expression indexes on their > own are probably not that much of a problem). There is a problem if any column of the index allows nulls, since that would allow multiple rows in the target table to match an argument. Proving that an expression does not could be tricky. I suggest that, at least for a first cut, we restrict this to matching an index on NOT NULL columns. >>>> Also, how about making the SET clause optional, with the semantics >>>> that we just update all of the fields for which a value is explicitly >>>> specified: >>>> >>>> INSERT INTO overwrite_with_abandon (key, value) >>>> VALUES (42, 'meaning of life') >>>> ON DUPLICATE (key) UPDATE; > >> Your syntax allows the exact same thing; it simply require the user to >> be more verbose in order to get that behavior. If we think that >> people wanting that behavior will be rare, then it's fine to require >> them to spell it out when they want it. If we think it will be the >> overwhelming common application of this feature, and I do, then making >> people spell it out when we could just as well infer it is pointless. +1 > Did you consider my example? I think that people will like this idea, > too - that clearly isn't the only consideration, though. As you say, > it would be very easy to implement this. However, IMV, we shouldn't, > because it is hazardous. To quote the cited case: > 1. Developer writes the query, and it works fine. > > 2. Some time later, the DBA adds an inserted_at column (those are > common). The DBA is not aware of the existence of this particular > query. The new column has a default value of now(), say. > > Should we UPDATE the inserted_at column to be NULL? Or (more > plausibly) the default value filled in by the INSERT? Or leave it be? > I think there is a case to be made for all of these behaviors, and > that tension makes me prefer to not do this at all. It's like > encouraging "SELECT *" queries in production, only worse. That does point out the problem, in general, with carrying values from a BEFORE INSERT trigger into an UPDATE. Perhaps if the INSERT fails the UPDATE phase should start with the values specified in the first place, and not try to use anything returned by the BEFORE INSERT triggers? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 10, 2014 at 2:17 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > That said, since the use case for UPSERT is different than both INSERT and > UPDATE maybe it would be a good idea to have a separate trigger for them > anyway. -1. I think that having a separate UPSERT trigger is a bad idea. I'll need to change the statement-level behavior to match what Kevin described (fires twice, always, when INSERT ON CONFLICT UPDATE is used). -- Peter Geoghegan
On Fri, Oct 10, 2014 at 2:16 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> Would you be okay with this never working with partial unique >> indexes? That gives me something to work with. > > That seems like the only sensible course, to me. Okay, great. If that's the consensus, then that's all I need to know. >> I'm not that worried about expression indexes, actually. I'm mostly >> worried about partial unique indexes, particularly when before insert >> row-level triggers are in play > There is a problem if any column of the index allows nulls, since > that would allow multiple rows in the target table to match an > argument. Proving that an expression does not could be tricky. I > suggest that, at least for a first cut, we restrict this to > matching an index on NOT NULL columns. Hmm. That could definitely be a problem. But the problem is clearly the user's fault. It could allow multiple rows to "match", as you say - but you're talking about a different concept of matching. Unlike with certain other systems, that isn't a would-be unique violation, and so it isn't a "match" in the sense I intend, and so you insert. You don't match more than one row, because (in the sense peculiar to this feature) there are no matches, and insertion really is the correct outcome. As a user, you have no more right to be surprised by this then by the fact that you wouldn't see a dup violation with a similar vanilla INSERT - some people are surprised by that, with regularity. Maybe I need to emphasize the distinction in the documentation between the two slightly different ideas of "matching" that are in tension here. I think it would be a mistake to forcibly normalize things to remove that distinction (by adding an artificial restriction), at any rate. >> Did you consider my example? I think that people will like this idea, >> too - that clearly isn't the only consideration, though. As you say, >> it would be very easy to implement this. However, IMV, we shouldn't, >> because it is hazardous. > > To quote the cited case: > >> 1. Developer writes the query, and it works fine. >> >> 2. Some time later, the DBA adds an inserted_at column (those are >> common). The DBA is not aware of the existence of this particular >> query. The new column has a default value of now(), say. >> >> Should we UPDATE the inserted_at column to be NULL? Or (more >> plausibly) the default value filled in by the INSERT? Or leave it be? >> I think there is a case to be made for all of these behaviors, and >> that tension makes me prefer to not do this at all. It's like >> encouraging "SELECT *" queries in production, only worse. > > That does point out the problem, in general, with carrying values > from a BEFORE INSERT trigger into an UPDATE. Perhaps if the INSERT > fails the UPDATE phase should start with the values specified in > the first place, and not try to use anything returned by the BEFORE > INSERT triggers? I think that that behavior is far more problematic for numerous other reasons, though. Potentially, you're not seeing what *really* caused the conflict at all. I just don't think it's worth it to save a bit of typing. -- Peter Geoghegan
On 2014-10-10 19:44, Kevin Grittner wrote: > Peter Geoghegan <pg@heroku.com> wrote: >> People keep remarking that they don't like that you can (optionally) >> name a unique index explicitly, [...] > To restate: to do so is conflating the logical definition of the > database with a particular implementation detail. As just one > reason that is a bad idea: we can look up unique indexes on the > specified columns, but if we implement a other storage techniques > where there is no such thing as a unique index on the columns, yet > manage to duplicate the semantics (yes, stranger things have > happened), people can't migrate to the new structure without > rewriting their queries Wouldn't it be good enough to define the 'WITHIN' as expecting a unique-constraint name rather than an index name (even though those happen to be the same strings)? I think constraints are part of the logical definition of the database, and a new storage technique which doesn't use indexes should still have names for its unique constraints. -M-
On 10/12/14, 2:36 PM, Matthew Woodcraft wrote: > On 2014-10-10 19:44, Kevin Grittner wrote: >> To restate: to do so is conflating the logical definition of the >> database with a particular implementation detail. As just one >> reason that is a bad idea: we can look up unique indexes on the >> specified columns, but if we implement a other storage techniques >> where there is no such thing as a unique index on the columns, yet >> manage to duplicate the semantics (yes, stranger things have >> happened), people can't migrate to the new structure without >> rewriting their queries > > Wouldn't it be good enough to define the 'WITHIN' as expecting a > unique-constraint name rather than an index name (even though those > happen to be the same strings)? > > I think constraints are part of the logical definition of the database, > and a new storage technique which doesn't use indexes should still have > names for its unique constraints. What about partial indexes? Indexes on expressions or functions calls? .marko
On 2014-10-12 13:40, Marko Tiikkaja wrote: > On 10/12/14, 2:36 PM, Matthew Woodcraft wrote: >> On 2014-10-10 19:44, Kevin Grittner wrote: >>> To restate: to do so is conflating the logical definition of the >>> database with a particular implementation detail. As just one >>> reason that is a bad idea: we can look up unique indexes on the >>> specified columns, but if we implement a other storage techniques >>> where there is no such thing as a unique index on the columns, yet >>> manage to duplicate the semantics (yes, stranger things have >>> happened), people can't migrate to the new structure without >>> rewriting their queries >> >> Wouldn't it be good enough to define the 'WITHIN' as expecting a >> unique-constraint name rather than an index name (even though those >> happen to be the same strings)? >> >> I think constraints are part of the logical definition of the database, >> and a new storage technique which doesn't use indexes should still have >> names for its unique constraints. > > What about partial indexes? Indexes on expressions or functions calls? On this theory, you'd be allowed to use them with 'WITHIN' (or whatever it would be called) if and when PostgreSQL gains the ability to create and manage them using a form of the CONSTRAINT clause. -M-
On Fri, Oct 10, 2014 at 4:41 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Oct 10, 2014 at 12:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think what's realistic here is that the patch isn't going to get >> committed the way you have it today, because nobody else likes that >> design. That may be harsh, but I think it's accurate. > > I do think that's harsh, because it's unnecessary: I am looking for > the best design. If you want to propose alternatives, great, but there > is a reason why I've done things that way, and that should be > acknowledged. I too think naming the unique index is ugly as sin, and > have said as much multiple times. We're almost on the same page here. Come on. You've had the syntax that way for a while, multiple people have objected to it, and it didn't get changed. When people renewed their objections, you said that they were not having a "realistic conversation". I'm tired of getting critiqued because there's some fine point of one of the scores of emails you've sent on this topic that you feel I haven't adequately appreciated. I would like to avoid getting into a flame-war here where we spend a lot of time arguing about who should have said what in exactly what way, but I think it is fair to say that your emails have come across to me, and to a number of other people here, as digging in your heels and refusing to budge. I would go so far as to say that said perception is the primary reason why this patch is making so little progress, far outstripping whatever the actual technical issues are. >> Whatever the actual behavior >> of your patch is today, it seems to me that the conflict is, >> fundamentally, over a set of column values, NOT over a particular >> index. The index is simply a mechanism for noticing that conflict. > > I think that this is the kernel of our disagreement. The index is not > simply a mechanism for noticing that conflict. The user had better > have one unique index in mind to provoke taking the alternative path > in the event of a would-be unique violation. Clearly it doesn't matter > much in this particular example. But what if there were two partial > unique indexes, that were actually distinct, but only in terms of the > constant appearing in their predicate? And what about having that > changed by a before insert row-level trigger? Are we to defer deciding > which unique index to use at the last possible minute? I don't immediately see why this would cause a problem. IIUC, the scenario we're talking about is: create table foo (a int, b int); create unique index foo1 on foo (a) where b = 1; create unique index foo2 on foo (a) where b = 2; If the user issues INSERT .. ON DUPLICATE (a) UPDATE, we'll fire before-insert triggers and then inspect the tuple to be inserted. If b is neither 1 nor 2, then we'll fail with an error saying that we can't support ON DUPLICATE without a relevant index to enforce uniqueness. (This can presumably re-use the same error message that would have popped out if the user done ON DUPLICATE (b), which is altogether un-indexed.) But if b is 1 or 2, then we'll search the matching index for a conflicting tuple; finding none, we'll insert; finding one, we'll do the update as per the user's instructions. > As always with this stuff, the devil is in the details. If we work out > the details, then I can come up with something that's acceptable to > everyone. Would you be okay with this never working with partial > unique indexes? That gives me something to work with. If it can't be made to work with them in a reasonable fashion, I would probably be OK with that, but so far I see no reason for such pessimism. >>>> Also, how about making the SET clause optional, with the semantics >>>> that we just update all of the fields for which a value is explicitly >>>> specified: >>>> >>>> INSERT INTO overwrite_with_abandon (key, value) >>>> VALUES (42, 'meaning of life') >>>> ON DUPLICATE (key) UPDATE; > >> Your syntax allows the exact same thing; it simply require the user to >> be more verbose in order to get that behavior. If we think that >> people wanting that behavior will be rare, then it's fine to require >> them to spell it out when they want it. If we think it will be the >> overwhelming common application of this feature, and I do, then making >> people spell it out when we could just as well infer it is pointless. > > Did you consider my example? I think that people will like this idea, > too - that clearly isn't the only consideration, though. As you say, > it would be very easy to implement this. However, IMV, we shouldn't, > because it is hazardous. MySQL doesn't allow this, and they tend to > find expedients like this useful. I'm considering your points in this area as well as I can, but as far as I can tell you haven't actually set what's bad about it, just that it might be hazardous in some way that you don't appear to have specified, and that MySQL doesn't allow it. I am untroubled by the latter point; it is not our goal to confine our implementation to a subset of MySQL. But it would be fine to leave this kind of shortcut out of the initial patch. If other people agree that it's useful, it will get re-proposed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Oct 13, 2014 at 7:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Come on. You've had the syntax that way for a while, multiple people > have objected to it, and it didn't get changed. When people renewed > their objections, you said that they were not having a "realistic > conversation". I'm tired of getting critiqued because there's some > fine point of one of the scores of emails you've sent on this topic > that you feel I haven't adequately appreciated. We're in the fine point business. Obviously the issues with partial unique indexes *are* pretty esoteric. But worrying about these edge cases are the kind of thing that will get us an implementation of high enough quality to commit. They're esoteric until they affect you. > I would like to avoid > getting into a flame-war here where we spend a lot of time arguing > about who should have said what in exactly what way, but I think it is > fair to say that your emails have come across to me, and to a number > of other people here, as digging in your heels and refusing to budge. I am not refusing to budge (in point of fact, on this question I have already budged; see my remarks below). I am saying: Hey, there is a reason why you're getting push back here. The reason isn't that I like "WITHIN unique_index_name" so much - obviously I don't. Who could? > I don't immediately see why this would cause a problem. IIUC, the > scenario we're talking about is: > > create table foo (a int, b int); > create unique index foo1 on foo (a) where b = 1; > create unique index foo2 on foo (a) where b = 2; > > If the user issues INSERT .. ON DUPLICATE (a) UPDATE, we'll fire > before-insert triggers and then inspect the tuple to be inserted. If > b is neither 1 nor 2, then we'll fail with an error saying that we > can't support ON DUPLICATE without a relevant index to enforce > uniqueness. (This can presumably re-use the same error message that > would have popped out if the user done ON DUPLICATE (b), which is > altogether un-indexed.) But if b is 1 or 2, then we'll search the > matching index for a conflicting tuple; finding none, we'll insert; > finding one, we'll do the update as per the user's instructions. Before row insert triggers might invalidate that conclusion at the last possible moment. So you'd have to recheck, which is just messy. >> As always with this stuff, the devil is in the details. If we work out >> the details, then I can come up with something that's acceptable to >> everyone. Would you be okay with this never working with partial >> unique indexes? That gives me something to work with. > > If it can't be made to work with them in a reasonable fashion, I would > probably be OK with that, but so far I see no reason for such > pessimism. At the very least I think that it would be a bad use of our resources to bend over backwards to make this work for partial unique indexes, which is what it would take. So I will proceed with the basic idea of naming columns, while not having that ever resolve partial unique indexes. > I'm considering your points in this area as well as I can, but as far > as I can tell you haven't actually set what's bad about it, just that > it might be hazardous in some way that you don't appear to have > specified, and that MySQL doesn't allow it. I am untroubled by the > latter point; it is not our goal to confine our implementation to a > subset of MySQL. I did - several times. I linked to the discussion [1]. I said "There is a trade-off here. I am willing to go another way in that trade-off, but let's have a realistic conversation about it". And Kevin eventually said of not supporting partial unique indexes: "That seems like the only sensible course, to me". At which point I agreed to do it that way [2]. So you've already won this argument. All it took was looking at my reasons for doing things that way from my perspective. If there has been digging of heals, you should consider that it might be for a reason, even if on balance you still disagree with my conclusion (which was clearly not really a firm conclusion in this instance anyway). That's all I mean here. I have been successful at talking you out of various approaches to this problem multiple times. This is not simple intransigence. [1] http://www.postgresql.org/message-id/CAM3SWZQpGf6_ME6D1vWqdCFadD7Nprdx2JrE=Hcf=93KXeHY4g@mail.gmail.com [2] http://www.postgresql.org/message-id/CAM3SWZStdChN6-ieJbc20OGD8TwmZ6-um6O8Gz2BOFzXn9YFVg@mail.gmail.com -- Peter Geoghegan
On Mon, Oct 13, 2014 at 2:02 PM, Peter Geoghegan <pg@heroku.com> wrote: >> If the user issues INSERT .. ON DUPLICATE (a) UPDATE, we'll fire >> before-insert triggers and then inspect the tuple to be inserted. If >> b is neither 1 nor 2, then we'll fail with an error saying that we >> can't support ON DUPLICATE without a relevant index to enforce >> uniqueness. (This can presumably re-use the same error message that >> would have popped out if the user done ON DUPLICATE (b), which is >> altogether un-indexed.) But if b is 1 or 2, then we'll search the >> matching index for a conflicting tuple; finding none, we'll insert; >> finding one, we'll do the update as per the user's instructions. > > Before row insert triggers might invalidate that conclusion at the > last possible moment. So you'd have to recheck, which is just messy. I can't imagine that you'd decide which index to use and then change your mind when you turn out to be wrong. I think rather you'd compute a list of possibly-applicable indexes based on the ON DUPLICATE column list, and then, after firing before-insert triggers, check whether there's one that will definitely work. If there's a non-partial unique index on the relevant columns, then you can put any single such index into the list of possibly-usable indexes and leave the rest out; otherwise, you include all the candidates and pick between them at runtime. If that seems too complicated, leave it out for v1: just insist that there must be at least one unique non-partial index on the relevant set of columns. >> I'm considering your points in this area as well as I can, but as far >> as I can tell you haven't actually set what's bad about it, just that >> it might be hazardous in some way that you don't appear to have >> specified, and that MySQL doesn't allow it. I am untroubled by the >> latter point; it is not our goal to confine our implementation to a >> subset of MySQL. > > I did - several times. I linked to the discussion [1]. I said "There > is a trade-off here. I am willing to go another way in that trade-off, > but let's have a realistic conversation about it". And Kevin > eventually said of not supporting partial unique indexes: "That seems > like the only sensible course, to me". At which point I agreed to do > it that way [2]. So you've already won this argument. All it took was > looking at my reasons for doing things that way from my perspective. > If there has been digging of heals, you should consider that it might > be for a reason, even if on balance you still disagree with my > conclusion (which was clearly not really a firm conclusion in this > instance anyway). That's all I mean here. There seems to be some confusion here. This part was about this syntax: >>>> INSERT INTO overwrite_with_abandon (key, value) >>>> VALUES (42, 'meaning of life') >>>> ON DUPLICATE (key) UPDATE; That's a different issue from naming indexes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Oct 16, 2014 at 6:48 AM, Robert Haas <robertmhaas@gmail.com> wrote: > If that seems too complicated, leave it out for v1: just insist that > there must be at least one unique non-partial index on the relevant > set of columns. That's what I'll do. > There seems to be some confusion here. This part was about this syntax: > >>>>> INSERT INTO overwrite_with_abandon (key, value) >>>>> VALUES (42, 'meaning of life') >>>>> ON DUPLICATE (key) UPDATE; > > That's a different issue from naming indexes. It is? In any case, I'm working on a revision with this syntax: postgres=# insert into upsert values (1, 'Foo') on conflict (key) update set val = conflicting(val); INSERT 0 1 postgres=# insert into upsert values (1, 'Foo') on conflict (val) update set val = conflicting(val); ERROR: 42P10: could not infer which unique index to use from expressions/columns provided for ON CONFLICT LINE 1: insert into upsert values (1, 'Foo') on conflict (val) updat... ^ HINT: Partial unique indexes are not supported LOCATION: transformConflictClause, parse_clause.c:2365 Expression indexes work fine with this syntax. I want to retain CONFLICTING(), although I'm thinking about changing the spelling to EXCLUDED(). While CONFLICTING() is more or less a new and unprecedented style of expression, and in general that's something to be skeptical of, I think it's appropriate because what we want here isn't quite like any existing expression. Using an alias-like syntax is misleading, since it implies that are no effects carried from the firing of before row insert triggers. It's also trickier to implement alias-like referencing. This is not a join, and I think suggesting that it is by using an alias-like syntax to refer to excluded rows proposed for insertion from the UPDATE is a mistake. -- Peter Geoghegan
On Thu, Oct 16, 2014 at 11:01 AM, Peter Geoghegan <pg@heroku.com> wrote: > It is? In any case, I'm working on a revision with this syntax: By the way, in my next revision, barring any objections, the ON CONFLICT (column/expression) syntax is mandatory in the case of ON CONFLICT UPDATE, and optional in the case of ON CONFLICT IGNORE. If we end up supporting exclusion constraints, it's pretty clear that they're only useful with ON CONFLICT IGNORE anyway, and so I guess we don't have to worry about naming those. I guess there would be some advantage to naming an exclusion constraint directly even for the IGNORE case (which is another reason I considered naming an index directly), but it doesn't seem that important. -- Peter Geoghegan
On Thu, Oct 16, 2014 at 2:01 PM, Peter Geoghegan <pg@heroku.com> wrote: > I want to retain CONFLICTING(), although I'm thinking about changing > the spelling to EXCLUDED(). While CONFLICTING() is more or less a new > and unprecedented style of expression, and in general that's something > to be skeptical of, I think it's appropriate because what we want here > isn't quite like any existing expression. Using an alias-like syntax > is misleading, since it implies that are no effects carried from the > firing of before row insert triggers. It's also trickier to implement > alias-like referencing. I think that the general consensus was against that style. I don't like it, and IIRC a few other people commented as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Oct 16, 2014 at 11:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Oct 16, 2014 at 2:01 PM, Peter Geoghegan <pg@heroku.com> wrote: >> I want to retain CONFLICTING(), although I'm thinking about changing >> the spelling to EXCLUDED(). While CONFLICTING() is more or less a new >> and unprecedented style of expression, and in general that's something >> to be skeptical of, I think it's appropriate because what we want here >> isn't quite like any existing expression. Using an alias-like syntax >> is misleading, since it implies that are no effects carried from the >> firing of before row insert triggers. It's also trickier to implement >> alias-like referencing. > > I think that the general consensus was against that style. I don't > like it, and IIRC a few other people commented as well. I think that's accurate, but I'd ask those that didn't like the style to reconsider. Also, I am willing to use any type of special expression, and any keyword or keywords. It doesn't have to be function-like at all. But I believe that an alias-like syntax presents certain difficulties. There is the fact that this isn't a join, and so misleads users, which I've explained. But if I have to add a range table entry for the alias to make parse analysis of the UPDATE work in the more or less conventional way (which is something I like a lot about the current design), then that creates considerable headaches. I have to discriminate against those in the optimizer, when time comes to disallow params in the auxiliary plan. I have to think about each case then, and I think that could add a lot more complexity than you'd think. I'm not really sure. Basically, it's difficult to make this work for technical reasons precisely because what I have here isn't join-like. Can I easily disallow "OLD.*" in a RETURNING clause (recall that we only project inserted tuples, as always)? Even if you think that's okay, I'd rather give an error message indicating that that makes no sense, which is what happens right now. Besides all that, there may also be an advantage to having something similar to MySQL. -- Peter Geoghegan
On Thu, Oct 16, 2014 at 8:00 PM, Peter Geoghegan <pg@heroku.com> wrote: > Basically, it's difficult to make this work for technical reasons > precisely because what I have here isn't join-like. Can I easily > disallow "OLD.*" in a RETURNING clause (recall that we only project > inserted tuples, as always)? Even if you think that's okay, I'd rather > give an error message indicating that that makes no sense, which is > what happens right now. Well OLD and NEW are also not joins yet we expose them this way. It always seemed like a hack to me but better one hack than two different inconsistent hacks, no? Independent of all that though. CONFLICTING() isn't clear to me -- conflicting is a relative property of two or more objects which are both (or all) conflicting with each other. Which one does "CONFLICTING" refer to here? -- greg
On Sun, Oct 19, 2014 at 2:52 PM, Greg Stark <stark@mit.edu> wrote: > Well OLD and NEW are also not joins yet we expose them this way. It > always seemed like a hack to me but better one hack than two different > inconsistent hacks, no? In my opinion, no. Those hacks do not appear in the parse analysis of ordinary optimizable statements, only utility statements concerning rules. ChangeVarNodes() does actual rewriting of magic Vars for stored rules during rewriting, much later. Within the parser code's directory, only transformRuleStmt() knows about the magic RTE entries appearing in rules (PRS2_OLD_VARNO and PRS2_NEW_VARNO), and that's in the file parse_utilcmd.c. OLD.* and NEW.* are dynamically expanded placeholders that only exist for the purposes of stored rules. You can see why at the very least it's a PITA to teach more or less ordinary parse analysis - my new invocation of transformUpdateStmt() - to care about that. Note that there are hardly any changes to transformUpdateStmt() in the patch, and as I've said, I like that a lot. Even with stored rules, we only see ChangeVarNodes() changing those OLD.*/NEW.* placeholders (that have dummy relation RTEs) to refer to actual, existing RTEs (I think that these are typically added by rule invocation, but I'm not an expert on the rewriter). We're not talking about creating a new RTE artificially during ordinary parse analysis of conventional/optimizable statements. In the patch, parse analysis thinks that CONFLICTING(my_var) is actually an ordinary reference to target.my_var that happens to involve ConflictingExpr, which I think is appropriate. Making available the post-before-insert-row-triggers tuple is driven by new code called by ExecInsert(). The executor drives all this. All I require from the planner is a dead simple sequential scan based update plan, that I can use the EvalPlanQual() infrastructure with (no more, no less), that doesn't attempt to apply any optimization (or refer to something external) that isn't compatible with how the plan needs to be used here. I want to ensure that the optimizer gives me only that, and magic RTEs that need to last until execution are bad news, in my (admittedly non-expert) opinion. I also happen to think that the two situations (stored rules versus CONFLICTING()) are fundamentally dissimilar from a user's perspective, as I've said. > Independent of all that though. CONFLICTING() isn't clear to me -- > conflicting is a relative property of two or more objects which are > both (or all) conflicting with each other. Which one does > "CONFLICTING" refer to here? I think you're right about that. I thought about changing the name to "EXCLUDED()", which reflects that the tuple is the actual tuple excluded from insertion. In particular, a version that carries forward all of the effects of before row insert triggers. What do you think of that? -- Peter Geoghegan