Обсуждение: Allow DELETE to use ORDER BY and LIMIT/OFFSET

Поиск
Список
Период
Сортировка

Allow DELETE to use ORDER BY and LIMIT/OFFSET

От
Yugo NAGATA
Дата:
Hello hackers,

We cannot use ORDER BY or LIMIT/OFFSET in the current
DELETE statement syntax, so all the row matching the
WHERE condition are deleted. However, the tuple retrieving
process of DELETE is basically same as SELECT statement,
so I think that we can also allow DELETE to use ORDER BY
and LIMIT/OFFSET.

Attached is the concept patch. This enables the following
operations:

================================================================
postgres=# select * from t order by i;
 i  
----
  1
  2
  2
  2
  2
  5
 10
 20
 33
 35
 53
(11 rows)

postgres=# delete from t where i = 2 limit 2;
DELETE 2
postgres=# select * from t order by i;
 i  
----
  1
  2
  2
  5
 10
 20
 33
 35
 53
(9 rows)

postgres=# delete from t order by i offset 3 limit 3;
DELETE 3
postgres=# select * from t order by i;
 i  
----
  1
  2
  2
 33
 35
 53
(6 rows)
================================================================

Although we can do the similar operations using ctid and a subquery
such as

 DELETE FROM t WHERE ctid IN (SELECT ctid FROM t WHERE ... ORDER BY ... LIMIT ...),

it is more user friendly and intuitive to allow it in the DELETE syntax
because ctid is a system column and most users may not be familiar with it.

Although this is not allowed in the SQL standard, it is supported
in MySQL[1]. DB2 also supports it although the syntax is somewhat
strange.[2]

Also, here seem to be some use cases. For example, 
- when you want to delete the specified number of rows from a table
  that doesn't have a primary key and contains tuple duplicated.
- when you want to delete the bottom 10 items with bad scores
  (without using rank() window function).
- when you want to delete only some of rows because it takes time
  to delete all of them.

[1] https://dev.mysql.com/doc/refman/8.0/en/delete.html
[2] https://www.dba-db2.com/2015/04/delete-first-1000-rows-in-a-db2-table-using-fetch-first.html

How do you think it?

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

Re: Allow DELETE to use ORDER BY and LIMIT/OFFSET

От
Yugo NAGATA
Дата:
On Fri, 17 Dec 2021 09:47:18 +0900
Yugo NAGATA <nagata@sraoss.co.jp> wrote:

> Hello hackers,
> 
> We cannot use ORDER BY or LIMIT/OFFSET in the current
> DELETE statement syntax, so all the row matching the
> WHERE condition are deleted. However, the tuple retrieving
> process of DELETE is basically same as SELECT statement,
> so I think that we can also allow DELETE to use ORDER BY
> and LIMIT/OFFSET.
> 
> Attached is the concept patch. This enables the following
> operations:

After post this, I noticed that there are several similar
proposals in past:

https://www.postgresql.org/message-id/flat/AANLkTi%3D6fBZh9yZT7f7kKh%2BzmQngAyHgZWBPM3eiEMj1%40mail.gmail.com
https://www.postgresql.org/message-id/flat/1393112801.59251.YahooMailNeo%40web163006.mail.bf1.yahoo.com
https://www.postgresql.org/message-id/flat/CADB9FDf-Vh6RnKAMZ4Rrg_YP9p3THdPbji8qe4qkxRuiOwm%3Dmg%40mail.gmail.com
https://www.postgresql.org/message-id/flat/CALAY4q9fcrscybax7fg_uojFwjw_Wg0UMuSrf-FvN68SeSAPAA%40mail.gmail.com

Anyway, I'll review these threads before progressing it.

> 
> ================================================================
> postgres=# select * from t order by i;
>  i  
> ----
>   1
>   2
>   2
>   2
>   2
>   5
>  10
>  20
>  33
>  35
>  53
> (11 rows)
> 
> postgres=# delete from t where i = 2 limit 2;
> DELETE 2
> postgres=# select * from t order by i;
>  i  
> ----
>   1
>   2
>   2
>   5
>  10
>  20
>  33
>  35
>  53
> (9 rows)
> 
> postgres=# delete from t order by i offset 3 limit 3;
> DELETE 3
> postgres=# select * from t order by i;
>  i  
> ----
>   1
>   2
>   2
>  33
>  35
>  53
> (6 rows)
> ================================================================
> 
> Although we can do the similar operations using ctid and a subquery
> such as
> 
>  DELETE FROM t WHERE ctid IN (SELECT ctid FROM t WHERE ... ORDER BY ... LIMIT ...),
> 
> it is more user friendly and intuitive to allow it in the DELETE syntax
> because ctid is a system column and most users may not be familiar with it.
> 
> Although this is not allowed in the SQL standard, it is supported
> in MySQL[1]. DB2 also supports it although the syntax is somewhat
> strange.[2]
> 
> Also, here seem to be some use cases. For example, 
> - when you want to delete the specified number of rows from a table
>   that doesn't have a primary key and contains tuple duplicated.
> - when you want to delete the bottom 10 items with bad scores
>   (without using rank() window function).
> - when you want to delete only some of rows because it takes time
>   to delete all of them.
> 
> [1] https://dev.mysql.com/doc/refman/8.0/en/delete.html
> [2] https://www.dba-db2.com/2015/04/delete-first-1000-rows-in-a-db2-table-using-fetch-first.html
> 
> How do you think it?
> 
> -- 
> Yugo NAGATA <nagata@sraoss.co.jp>


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Allow DELETE to use ORDER BY and LIMIT/OFFSET

От
Tom Lane
Дата:
Yugo NAGATA <nagata@sraoss.co.jp> writes:
> We cannot use ORDER BY or LIMIT/OFFSET in the current
> DELETE statement syntax, so all the row matching the
> WHERE condition are deleted. However, the tuple retrieving
> process of DELETE is basically same as SELECT statement,
> so I think that we can also allow DELETE to use ORDER BY
> and LIMIT/OFFSET.

Indeed, this is technically possible, but we've rejected the idea
before and I'm not aware of any reason to change our minds.
The problem is that a partial DELETE is not very deterministic
about which rows are deleted, and that does not seem like a
great property for a data-updating command.  (The same applies
to UPDATE, which is why we don't allow these options in that
command either.)  The core issues are:

* If the sort order is underspecified, or you omit ORDER BY
entirely, then it's not clear which rows will be operated on.
The LIMIT might stop after just some of the rows in a peer
group, and you can't predict which ones.

* UPDATE/DELETE necessarily involve the equivalent of SELECT
FOR UPDATE, which may cause the rows to be ordered more
surprisingly than you expected, ie the sort happens *before*
rows are replaced by their latest versions, which might have
different sort keys.

We live with this amount of indeterminism in SELECT, but that
doesn't make it a brilliant idea to allow it in UPDATE/DELETE.

            regards, tom lane



Re: Allow DELETE to use ORDER BY and LIMIT/OFFSET

От
"David G. Johnston"
Дата:
On Thursday, December 16, 2021, Yugo NAGATA <nagata@sraoss.co.jp> wrote:

Also, here seem to be some use cases. For example,
- when you want to delete the specified number of rows from a table
  that doesn't have a primary key and contains tuple duplicated.

Not our problem…use the tools correctly; there is always the hack work-around for the few who didn’t.
 
- when you want to delete the bottom 10 items with bad scores
  (without using rank() window function).

This one doesn’t make sense to me.

- when you want to delete only some of rows because it takes time
  to delete all of them.


This seems potentially compelling though I’d be more concerned about the memory aspects than simply taking a long amount of time.  If this is a problem then maybe discuss it without having a solution-in-hand?  But given the intense I/O cost that would happen spreading this out over time seems acceptable and it should be an infrequent thing to do.  Expecting users to plan and execute some custom code for their specific need seems reasonable.

So even if Tom’s technical concerns aren’t enough working on this based upon these uses cases doesn’t seem of high enough benefit.

David J.

Re: Allow DELETE to use ORDER BY and LIMIT/OFFSET

От
Yugo NAGATA
Дата:
On Thu, 16 Dec 2021 22:17:58 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Yugo NAGATA <nagata@sraoss.co.jp> writes:
> > We cannot use ORDER BY or LIMIT/OFFSET in the current
> > DELETE statement syntax, so all the row matching the
> > WHERE condition are deleted. However, the tuple retrieving
> > process of DELETE is basically same as SELECT statement,
> > so I think that we can also allow DELETE to use ORDER BY
> > and LIMIT/OFFSET.
> 
> Indeed, this is technically possible, but we've rejected the idea
> before and I'm not aware of any reason to change our minds.
> The problem is that a partial DELETE is not very deterministic
> about which rows are deleted, and that does not seem like a
> great property for a data-updating command.  (The same applies
> to UPDATE, which is why we don't allow these options in that
> command either.)  The core issues are:
> 
> * If the sort order is underspecified, or you omit ORDER BY
> entirely, then it's not clear which rows will be operated on.
> The LIMIT might stop after just some of the rows in a peer
> group, and you can't predict which ones.
> 
> * UPDATE/DELETE necessarily involve the equivalent of SELECT
> FOR UPDATE, which may cause the rows to be ordered more
> surprisingly than you expected, ie the sort happens *before*
> rows are replaced by their latest versions, which might have
> different sort keys.
> 
> We live with this amount of indeterminism in SELECT, but that
> doesn't make it a brilliant idea to allow it in UPDATE/DELETE.

Thank you for your explaining it! 
I'm glad to understand why this idea is not good and has been rejected.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Allow DELETE to use ORDER BY and LIMIT/OFFSET

От
Yugo NAGATA
Дата:
On Thu, 16 Dec 2021 20:56:59 -0700
"David G. Johnston" <david.g.johnston@gmail.com> wrote:

> On Thursday, December 16, 2021, Yugo NAGATA <nagata@sraoss.co.jp> wrote:
> 
> >
> > Also, here seem to be some use cases. For example,
> > - when you want to delete the specified number of rows from a table
> >   that doesn't have a primary key and contains tuple duplicated.
> 
> 
> Not our problem…use the tools correctly; there is always the hack
> work-around for the few who didn’t.
> 
> 
> > - when you want to delete the bottom 10 items with bad scores
> >   (without using rank() window function).
> 
> 
> This one doesn’t make sense to me.
> 
> - when you want to delete only some of rows because it takes time
> >   to delete all of them.
> >
> >
> This seems potentially compelling though I’d be more concerned about the
> memory aspects than simply taking a long amount of time.  If this is a
> problem then maybe discuss it without having a solution-in-hand?  But given
> the intense I/O cost that would happen spreading this out over time seems
> acceptable and it should be an infrequent thing to do.  Expecting users to
> plan and execute some custom code for their specific need seems reasonable.
> 
> So even if Tom’s technical concerns aren’t enough working on this based
> upon these uses cases doesn’t seem of high enough benefit.

Thank you for your comments.
Ok. I agree that there are not so strong use cases.

Regards,
Yugo Nagata


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Allow DELETE to use ORDER BY and LIMIT/OFFSET

От
Greg Stark
Дата:
On Thu, 16 Dec 2021 at 22:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> * If the sort order is underspecified, or you omit ORDER BY
> entirely, then it's not clear which rows will be operated on.
> The LIMIT might stop after just some of the rows in a peer
> group, and you can't predict which ones.

Meh, that never seemed very compelling to me. I think that's on the
user and there are *lots* of cases where the user can easily know
enough extra context to know that's what she wants. In particular I've
often wanted to delete one of two identical records and it would have
been really easy to just do a DELETE LIMIT 1. I know there are ways to
do it but it's always seemed like unnecessary hassle when there was a
natural way to express it.

> * UPDATE/DELETE necessarily involve the equivalent of SELECT
> FOR UPDATE, which may cause the rows to be ordered more
> surprisingly than you expected, ie the sort happens *before*
> rows are replaced by their latest versions, which might have
> different sort keys.

This... is a real issue. Or is it? Just to be clear I think what
you're referring to is a case like:

INSERT INTO t values (1),(2)

In another session: BEGIN; UPDATE t set c=0 where c=2

DELETE FROM t ORDER BY c ASC LIMIT 1
<blocks>

In other session: COMMIT

Which row was deleted? In this case it was the row where c=1. Even
though the UPDATE reported success (i.e. 1 row updated) so presumably
it happened "before" the delete. The delete saw the ordering from
before it was blocked and saw the ordering with c=1, c=2 so apparently
it happened "before" the update.

There are plenty of other ways to see the same surprising
serialization failure. If instead of a DELETE you did an UPDATE there
are any number of functions or other features that have been added
which can expose the order in which the updates happened.

The way to solve those cases would be to use serializable isolation
(or even just repeatable read in this case). Wouldn't that also solve
the DELETE serialization failure too?

-- 
greg



Re: Allow DELETE to use ORDER BY and LIMIT/OFFSET

От
Yugo NAGATA
Дата:
Hello Greg,

On Fri, 17 Dec 2021 01:40:45 -0500
Greg Stark <stark@mit.edu> wrote:

> On Thu, 16 Dec 2021 at 22:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > * If the sort order is underspecified, or you omit ORDER BY
> > entirely, then it's not clear which rows will be operated on.
> > The LIMIT might stop after just some of the rows in a peer
> > group, and you can't predict which ones.
> 
> Meh, that never seemed very compelling to me. I think that's on the
> user and there are *lots* of cases where the user can easily know
> enough extra context to know that's what she wants. In particular I've
> often wanted to delete one of two identical records and it would have
> been really easy to just do a DELETE LIMIT 1. I know there are ways to
> do it but it's always seemed like unnecessary hassle when there was a
> natural way to express it.

Out of curiosity, could you please tell me the concrete situations
where you wanted to delete one of two identical records?

> > * UPDATE/DELETE necessarily involve the equivalent of SELECT
> > FOR UPDATE, which may cause the rows to be ordered more
> > surprisingly than you expected, ie the sort happens *before*
> > rows are replaced by their latest versions, which might have
> > different sort keys.
> 
> This... is a real issue. Or is it? Just to be clear I think what
> you're referring to is a case like:
> 
> INSERT INTO t values (1),(2)
> 
> In another session: BEGIN; UPDATE t set c=0 where c=2
> 
> DELETE FROM t ORDER BY c ASC LIMIT 1
> <blocks>
> 
> In other session: COMMIT
> 
> Which row was deleted? In this case it was the row where c=1. Even
> though the UPDATE reported success (i.e. 1 row updated) so presumably
> it happened "before" the delete. The delete saw the ordering from
> before it was blocked and saw the ordering with c=1, c=2 so apparently
> it happened "before" the update.

When I tried it using my patch, the DELETE deletes the row where c=1
as same as above, but it did not block. Is that the result of an
experiment using my patch or other RDBMS like MySQL?
 
> There are plenty of other ways to see the same surprising
> serialization failure. If instead of a DELETE you did an UPDATE there
> are any number of functions or other features that have been added
> which can expose the order in which the updates happened.
> 
> The way to solve those cases would be to use serializable isolation
> (or even just repeatable read in this case). Wouldn't that also solve
> the DELETE serialization failure too?

Do you mean such serialization failures would be avoided in
SERIALIZABLE or REPATABLE READ by aborting the transaction, such 
serialization failures may be accepted in READ COMMITTED?

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Allow DELETE to use ORDER BY and LIMIT/OFFSET

От
Corey Huinker
Дата:
Out of curiosity, could you please tell me the concrete situations
where you wanted to delete one of two identical records?

In my case, there is a table with known duplicates, and we would like to delete all but the one with the lowest ctid, and then add a unique index to the table which then allows us to use INSERT ON CONFLICT in a meaningful way.

The other need for a DELETE...LIMIT or UPDATE...LIMIT is when you're worried about flooding a replica, so you parcel out the DML into chunks that don't cause unacceptable lag on the replica.

Both of these can be accomplished via  DELETE FROM foo WHERE ctid IN ( SELECT ... FROM foo ... LIMIT 1000), but until recently such a construct would result in a full table scan, and you'd take the same hit with each subsequent DML.

I believe that the ctid range scan now can limit those scans, especially if you can order the limited set by ctid, but those techniques are not widely known at this time.