Обсуждение: Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

Поиск
Список
Период
Сортировка

Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

От
Peter Geoghegan
Дата:
Someone suggested to me that I solicit opinion on the chosen syntax of
INSERT...ON DUPLICATE KEY LOCK FOR UPDATE on a separate thread. I'll
do that here, while also giving a formal user-level definition of the
feature. I'd like to solicit feedback from a wider set of people than
those participating in the main thread, while avoiding talking about
arcane details around locking which have dominated discussions up
until this point.

The patch proposed adds an optional ON DUPLICATE KEY LOCK FOR UPDATE
clause. It also extends the RETURNING clause to support projecting
"REJECTS".

Idiomatic wCTE usage, through which it's possible to build an
"UPSERT", might look like:

CREATE TABLE foo
( a int4 PRIMARY KEY, b int4, c text
);

WITH rej AS (
INSERT INTO foo (a, b, c)
-- Multiple rows proposed for insertion - may insert or update each
VALUES (44, 1078, 'insert'), (55, 1088, 'insert')
ON DUPLICATE KEY LOCK FOR UPDATE
RETURNING REJECTS a)
UPDATE foo SET c = 'update' FROM rej WHERE foo.a = rej.a;

This has what I like to call the fundamental upsert property: You
always either get an insert, or an update. People are already
incorrectly using wCTEs like this; I thought I'd make it work
correctly.

I believe that we'll be well served by the flexibility of what I've
proposed. In many respects it is similar to SQL MERGE. For example, I
could have deleted rather than updated, or I could have had a more
complex predicate in the update clause, so that updates did not occur
on all rejecting rows. While I'm not particularly attached to the
syntax, I do think that the flexibility is a key strength.

I recently suggested that rather than RETURNING REJECTS, we could have
a REJECTING clause, which would see a DML statement project strictly
the complement of what RETURNING projects in the same context. So
perhaps you could also see what RETURNING would not have projected
because a before row trigger returned NULL (i.e. when a before trigger
indicates to not proceed with insertion). That is certainly more
general, and so is perhaps preferable. It's also less verbose, and it
seems less likely to matter that we'll need to make REJECTING a fully
reserved keyword, as compared to REJECTS. (RETURNING is already a
fully reserved keyword not described by the standard, so this makes a
certain amount of sense to me). If nothing else, REJECTING is more
terse than RETURNING REJECTS.

Like the MySQL feature "INSERT...ON DUPLICATE KEY UPDATE", duplicates
are defined as would-be violators of any and all unique indexes. Like
the MySQL feature, the user is generally obligated to make sure they
know ahead of time which unique index any violation may come from, or
else they may update the wrong row (i.e. the row they ultimately
update is not already locked). Unlike the MySQL feature, expert users
have some capacity to recover if that problem is anticipated, because
we also may project out the rejecting row's ctid.

Notably the wCTE pattern, with a ctid join condition doesn't work very
well, because, as src/backend/optimizer/path/tidpath.c says:
* There is currently no special support for joins involving CTID; in* particular nothing corresponding to
best_inner_indexscan().   Since it's* not very useful to store TIDs of one table in another table, there* doesn't seem
tobe enough use-case to justify adding a lot of code* for that.
 

You end up with a seqscan, not a tidscan, so I don't think every
novice user is going to try this as a premature optimization, without
appreciating the hazards of tid updates. Projecting the rejecting
row's tid is an expert level feature, added mostly with things like
multi-master replication conflict resolution in mind. Those use-cases
will find this feature quite important, and will particularly value
the flexibility. In fact, I think for that use-case, it's even more
useful than SQL MERGE for a couple of reasons, in particular the
capability to defer doing anything with the locked row until later
commands.

This feature is not supposed to fully satisfy those calling for SQL
MERGE. I anticipate that we'll still get MERGE in a future release,
and the implementation anticipates this as well.

More formally, what the feature does is:

* Ensure that a row is either inserted successfully, or that if an
effort to do so was unsuccessful, the first conclusively committed
tuple with a conflicting value is exclusive locked. Not all
conflicting tuples are locked, just the first, although the order in
which we check unique indexes for conflicts is well defined (at least
with my implementation, where when we find a conflicting TID, it must
really be the first one at that juncture, because all previous unique
indexes are value locked).

* Ensure that in READ COMMITTED mode, the locked row is always
visible. Since we can get a lock on the tuple blamed for insertion not
proceeding, it must be conclusively committed and not updated or
deleted by anyone else, but in and of itself that isn't sufficient.
The tuple's xact may logically be still-in-progress to our snapshot,
and it would not be acceptable to have it be impossible to update for
that reason. So there is a special case adjustment to the general
semantics of MVCC snapshots, just to facilitate this feature. Higher
isolation levels actively forbid themselves from proceeding when they
observe that they'll have to avail of this special rule to update, by
throwing a serialization failure.

-- 
Peter Geoghegan



Re: Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

От
Craig Ringer
Дата:
On 01/12/2014 06:42 AM, Peter Geoghegan wrote:
> Someone suggested to me that I solicit opinion on the chosen syntax of
> INSERT...ON DUPLICATE KEY LOCK FOR UPDATE on a separate thread. I'll
> do that here, while also giving a formal user-level definition of the
> feature. I'd like to solicit feedback from a wider set of people than
> those participating in the main thread, while avoiding talking about
> arcane details around locking which have dominated discussions up
> until this point.

After reading the following trimmed text I can see that the idea behind
this solves a real user problem.

It's not beautiful - but we've been rejecting things that solve the
upsert problem for users for ages. This looks technically solid enough
to do the job, useful, and reasonably usable with guidance from the
documentation.

My main question is how (if) this interacts with COPY. Would users need
to continue to COPY to an UNLOGGED or TEMP table, then use this for an
upsert-like operation into the target table? Or is it likely that COPY
can be extended to expose similar functionality - or wrap this into a
copy-on-duplicate-key-update ?


-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

От
Andreas Karlsson
Дата:
On 01/11/2014 11:42 PM, Peter Geoghegan wrote:
> I recently suggested that rather than RETURNING REJECTS, we could have
> a REJECTING clause, which would see a DML statement project strictly
> the complement of what RETURNING projects in the same context. So
> perhaps you could also see what RETURNING would not have projected
> because a before row trigger returned NULL (i.e. when a before trigger
> indicates to not proceed with insertion). That is certainly more
> general, and so is perhaps preferable. It's also less verbose, and it
> seems less likely to matter that we'll need to make REJECTING a fully
> reserved keyword, as compared to REJECTS. (RETURNING is already a
> fully reserved keyword not described by the standard, so this makes a
> certain amount of sense to me). If nothing else, REJECTING is more
> terse than RETURNING REJECTS.

I do not entirely understand what you are proposing here.  Any example 
how this would look compared to your RETURNING REJECTS proposal?

-- 
Andreas Karlsson



Re: Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

От
Peter Geoghegan
Дата:
On Sun, Jan 12, 2014 at 8:12 AM, Andreas Karlsson <andreas@proxel.se> wrote:
> On 01/11/2014 11:42 PM, Peter Geoghegan wrote:
>> I recently suggested that rather than RETURNING REJECTS, we could have
>> a REJECTING clause, which would see a DML statement project strictly
>> the complement of what RETURNING projects in the same context. So
>> perhaps you could also see what RETURNING would not have projected
>> because a before row trigger returned NULL (i.e. when a before trigger
>> indicates to not proceed with insertion). That is certainly more
>> general, and so is perhaps preferable. It's also less verbose, and it
>> seems less likely to matter that we'll need to make REJECTING a fully
>> reserved keyword, as compared to REJECTS. (RETURNING is already a
>> fully reserved keyword not described by the standard, so this makes a
>> certain amount of sense to me). If nothing else, REJECTING is more
>> terse than RETURNING REJECTS.
>
> I do not entirely understand what you are proposing here.  Any example how
> this would look compared to your RETURNING REJECTS proposal?

It's very similar - REJECTING is a total generalization of what I
already have. The difference is only that REJECTING is accepted in all
contexts that RETURNING is, and not just with INSERT...ON DUPLICATE
KEY LOCK FOR UPDATE. So you could potentially have REJECTING project
the slot proposed for insertion on an UPDATE where RETURNING would
not. If for example a BEFORE ROW trigger fired, and returned NULL,
perhaps it'd then be possible to project the slot as it was before
being passed to the trigger. Perhaps there is no real demand for that,
but, as I said, from a usability perspective it may be easier to
reason about a feature that projects strictly the complement of what
RETURNING would project in the same context.

-- 
Peter Geoghegan



Re: Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

От
Andreas Karlsson
Дата:
On 01/12/2014 11:20 PM, Peter Geoghegan wrote:
> On Sun, Jan 12, 2014 at 8:12 AM, Andreas Karlsson <andreas@proxel.se> wrote:
>> On 01/11/2014 11:42 PM, Peter Geoghegan wrote:
>>> I recently suggested that rather than RETURNING REJECTS, we could have
>>> a REJECTING clause, which would see a DML statement project strictly
>>> the complement of what RETURNING projects in the same context. So
>>> perhaps you could also see what RETURNING would not have projected
>>> because a before row trigger returned NULL (i.e. when a before trigger
>>> indicates to not proceed with insertion). That is certainly more
>>> general, and so is perhaps preferable. It's also less verbose, and it
>>> seems less likely to matter that we'll need to make REJECTING a fully
>>> reserved keyword, as compared to REJECTS. (RETURNING is already a
>>> fully reserved keyword not described by the standard, so this makes a
>>> certain amount of sense to me). If nothing else, REJECTING is more
>>> terse than RETURNING REJECTS.
>>
>> I do not entirely understand what you are proposing here.  Any example how
>> this would look compared to your RETURNING REJECTS proposal?
>
> It's very similar - REJECTING is a total generalization of what I
> already have. The difference is only that REJECTING is accepted in all
> contexts that RETURNING is, and not just with INSERT...ON DUPLICATE
> KEY LOCK FOR UPDATE. So you could potentially have REJECTING project
> the slot proposed for insertion on an UPDATE where RETURNING would
> not. If for example a BEFORE ROW trigger fired, and returned NULL,
> perhaps it'd then be possible to project the slot as it was before
> being passed to the trigger. Perhaps there is no real demand for that,
> but, as I said, from a usability perspective it may be easier to
> reason about a feature that projects strictly the complement of what
> RETURNING would project in the same context.

So simply this?

WITH rej AS (    INSERT INTO foo (a, b, c)    VALUES (44, 1078, 'insert'), (55, 1088, 'insert')    REJECTING a)
UPDATE foo SET c = 'update' FROM rej WHERE foo.a = rej.a;

Another question: have you given any thought on the case where you want 
to use both the successfully inserted tuples and the rejected and use in 
the CTE? Is that even something anyone would want? Would perhaps MERGE 
be more suited for that?

-- 
Andreas Karlsson



Re: Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

От
Jim Nasby
Дата:
On 1/12/14, 9:35 PM, Andreas Karlsson wrote:
> On 01/12/2014 11:20 PM, Peter Geoghegan wrote:
>> On Sun, Jan 12, 2014 at 8:12 AM, Andreas Karlsson <andreas@proxel.se> wrote:
>>> On 01/11/2014 11:42 PM, Peter Geoghegan wrote:
>>>> I recently suggested that rather than RETURNING REJECTS, we could have
>>>> a REJECTING clause, which would see a DML statement project strictly
>>>> the complement of what RETURNING projects in the same context. So
>>>> perhaps you could also see what RETURNING would not have projected
>>>> because a before row trigger returned NULL (i.e. when a before trigger
>>>> indicates to not proceed with insertion). That is certainly more
>>>> general, and so is perhaps preferable. It's also less verbose, and it
>>>> seems less likely to matter that we'll need to make REJECTING a fully
>>>> reserved keyword, as compared to REJECTS. (RETURNING is already a
>>>> fully reserved keyword not described by the standard, so this makes a
>>>> certain amount of sense to me). If nothing else, REJECTING is more
>>>> terse than RETURNING REJECTS.
>>>
>>> I do not entirely understand what you are proposing here.  Any example how
>>> this would look compared to your RETURNING REJECTS proposal?
>>
>> It's very similar - REJECTING is a total generalization of what I
>> already have. The difference is only that REJECTING is accepted in all
>> contexts that RETURNING is, and not just with INSERT...ON DUPLICATE
>> KEY LOCK FOR UPDATE. So you could potentially have REJECTING project
>> the slot proposed for insertion on an UPDATE where RETURNING would
>> not. If for example a BEFORE ROW trigger fired, and returned NULL,
>> perhaps it'd then be possible to project the slot as it was before
>> being passed to the trigger. Perhaps there is no real demand for that,
>> but, as I said, from a usability perspective it may be easier to
>> reason about a feature that projects strictly the complement of what
>> RETURNING would project in the same context.
>
> So simply this?
>
> WITH rej AS (
>      INSERT INTO foo (a, b, c)
>      VALUES (44, 1078, 'insert'), (55, 1088, 'insert')
>      REJECTING a)
> UPDATE foo SET c = 'update' FROM rej WHERE foo.a = rej.a;
>
> Another question: have you given any thought on the case where you want to use both the successfully inserted tuples
andthe rejected and use in the CTE? Is that even something anyone would want? Would perhaps MERGE be more suited for
that?

Well, a common case for INSERT RETURNING is to get your set of surrogate keys back; so I think users would want the
abilityto RETURN what finally made it into the table.
 

Also, if we want to support the case of identifying tuples where a BEFORE trigger disallowed the insert, we probably
wantto expose that that's why those tuples were rejected (as opposed to them being rejected due to a duplicate key
violation).
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

От
Peter Geoghegan
Дата:
On Mon, Jan 13, 2014 at 2:20 PM, Jim Nasby <jim@nasby.net> wrote:
> Well, a common case for INSERT RETURNING is to get your set of surrogate
> keys back; so I think users would want the ability to RETURN what finally
> made it into the table.

Your update can also have a RETURNING clause. I'm not necessarily that
attached to fully generalizing RETURNING REJECTS as REJECTING. It was
just an idea. When an insert is rejected and you lock a conflicting
row, it hardly matters what your surrogate key might have been had
that insert succeeded.

To get the surrogate key when it upsert inserts, do a regular
INSERT....RETURNING..., and break the work up into multiple commands.
That will almost always be sufficient, because you'll almost always
know ahead of time where the conflict might be (certainly, the MySQL
feature mandates that you do know).

> Also, if we want to support the case of identifying tuples where a BEFORE
> trigger disallowed the insert, we probably want to expose that that's why
> those tuples were rejected (as opposed to them being rejected due to a
> duplicate key violation).

The ctid *won't* indicate a specific rejecting row then, I guess,
which will do it.

-- 
Peter Geoghegan



Re: Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

От
Jim Nasby
Дата:
On 1/13/14, 5:05 PM, Peter Geoghegan wrote:
> On Mon, Jan 13, 2014 at 2:20 PM, Jim Nasby <jim@nasby.net> wrote:
>> Well, a common case for INSERT RETURNING is to get your set of surrogate
>> keys back; so I think users would want the ability to RETURN what finally
>> made it into the table.
>
> Your update can also have a RETURNING clause. I'm not necessarily that
> attached to fully generalizing RETURNING REJECTS as REJECTING. It was
> just an idea. When an insert is rejected and you lock a conflicting
> row, it hardly matters what your surrogate key might have been had
> that insert succeeded.
>
> To get the surrogate key when it upsert inserts, do a regular
> INSERT....RETURNING..., and break the work up into multiple commands.
> That will almost always be sufficient, because you'll almost always
> know ahead of time where the conflict might be (certainly, the MySQL
> feature mandates that you do know).

As long as there's a way to get back what was ultimately inserted or updated that'd work... there might be some cases
whereyou'd actually want to know what the result of the REJECTING command was (ie: did the update do something
fancy?).

Actually, you'd also want to know if triggers did anything. So we definitely want to keep the existing RETURNING
behavior(sorry, I don't know offhand if you've kept that or not).
 

>> Also, if we want to support the case of identifying tuples where a BEFORE
>> trigger disallowed the insert, we probably want to expose that that's why
>> those tuples were rejected (as opposed to them being rejected due to a
>> duplicate key violation).
>
> The ctid *won't* indicate a specific rejecting row then, I guess,
> which will do it.

Yeah, the only other thing you might want is the name of the trigger that returned NULL... that would allow you to do
somethingdifferent based on which trigger it was.
 

Part of me thinks that'd be useful... part of me thinks it's just a foot-gun...
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net