Обсуждение: Finding rows in table T1 that DO NOT MATCH any row in table T2
Hi,
I have two large tables T1 and T2, such that T2 has a FK to T1 (i.e. T2.FK --> T1.PK, possibly multiple T2 rows may reference the same T1 row). I have deleted about 2/3 of table T2. I now want to delete all rows in T1 that are not referenced by T2, i.e. all rows in T1 that cannot join with (any row in) T2 on the condition T2.FK = T1.PK (the opposite of a join...)
I assume this will work but will take a long time:
DELETE * FROM T1 where T1.PK NOT IN
(SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK)
What is an efficient way to do this?
Thanks,
-- Shaul
I have two large tables T1 and T2, such that T2 has a FK to T1 (i.e. T2.FK --> T1.PK, possibly multiple T2 rows may reference the same T1 row). I have deleted about 2/3 of table T2. I now want to delete all rows in T1 that are not referenced by T2, i.e. all rows in T1 that cannot join with (any row in) T2 on the condition T2.FK = T1.PK (the opposite of a join...)
I assume this will work but will take a long time:
DELETE * FROM T1 where T1.PK NOT IN
(SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK)
What is an efficient way to do this?
Thanks,
-- Shaul
How about:
DELETE * FROM T1 LEFT JOIN T2 ON T1.PK = T2.FK
WHERE T2.FK IS NULL
Shaul
DELETE * FROM T1 LEFT JOIN T2 ON T1.PK = T2.FK
WHERE T2.FK IS NULL
Shaul
On Tue, Oct 20, 2009 at 2:37 PM, Shaul Dar <shauldar@gmail.com> wrote:
Hi,
I have two large tables T1 and T2, such that T2 has a FK to T1 (i.e. T2.FK --> T1.PK, possibly multiple T2 rows may reference the same T1 row). I have deleted about 2/3 of table T2. I now want to delete all rows in T1 that are not referenced by T2, i.e. all rows in T1 that cannot join with (any row in) T2 on the condition T2.FK = T1.PK (the opposite of a join...)
I assume this will work but will take a long time:
DELETE * FROM T1 where T1.PK NOT IN
(SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK)
What is an efficient way to do this?
Thanks,
-- Shaul
In response to Shaul Dar : > Hi, > > I have two large tables T1 and T2, such that T2 has a FK to T1 (i.e. T2.FK --> > T1.PK, possibly multiple T2 rows may reference the same T1 row). I have deleted > about 2/3 of table T2. I now want to delete all rows in T1 that are not > referenced by T2, i.e. all rows in T1 that cannot join with (any row in) T2 on > the condition T2.FK = T1.PK (the opposite of a join...) > > I assume this will work but will take a long time: > > DELETE * FROM T1 where T1.PK NOT IN > (SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK) > > What is an efficient way to do this? > Thanks, Maybe this one: (my id is your pk): delete from t1 where t1.id in (select t1.id from t1 left join t2 using (id) where t2.id is null); Try it, and/or use explain for both versions and see which which is faster. Andreas -- Andreas Kretschmer Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
Shaul Dar <shauldar@gmail.com> writes: > I assume this will work but will take a long time: > DELETE * FROM T1 where T1.PK NOT IN > (SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK) Well, yeah, but it's unnecessarily inefficient --- why not just DELETE FROM T1 where T1.PK NOT IN (SELECT T2.FK FROM T2) However, that still won't be tremendously fast unless the subselect fits in work_mem. As of 8.4 this variant should be reasonable: DELETE FROM T1 where NOT EXISTS (SELECT 1 FROM T2 where T1.PK = T2.FK) Pre-8.4 you should resort to the "left join where is null" trick, but there's no need to be so obscure as of 8.4. regards, tom lane
How about
DELETE FROM T1 WHERE T1.PK IN
Mel
On Tue, Oct 20, 2009 at 7:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Shaul Dar <shauldar@gmail.com> writes:Well, yeah, but it's unnecessarily inefficient --- why not just
> I assume this will work but will take a long time:
> DELETE * FROM T1 where T1.PK NOT IN
> (SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK)
(SELECT T2.FK FROM T2)
However, that still won't be tremendously fast unless the subselect fits
in work_mem. As of 8.4 this variant should be reasonable:
DELETE FROM T1 where NOT EXISTS
(SELECT 1 FROM T2 where T1.PK = T2.FK)
Pre-8.4 you should resort to the "left join where is null" trick,
but there's no need to be so obscure as of 8.4.
regards, tom lane
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Tom,
1. Actually I just tested you suggestion
SELECT COUNT (*) FROM T1 where NOT EXISTS
(SELECT 1 FROM T2 where T1.PK = T2.FK)
and in worked in PG 8.3.8. On a DB with 6M T1 records and 5M T2 records it took 1m8s,
My suggestion, i.e.
SELECT COUNT(*) FROM T1 LEFT JOIN T2 ON T1.PK = T2.FK
WHERE T2.FK IS NULL
was about twice as fast, 37s. (both returned same number of rows, about 2/3 of T1)
However I can use DELETE with your version (instead of "SELECT COUNT (*)" above) but not with mine (can't have LEFT JOIN in DELETE), so YOU WIN. Thanks!
2. BTW. I presented my question earlier in an overly simplified fashion. Sorry. In actuality the two tables are joined on two columns,
say Ka and Kb (a composite key column), e.g. T1.PKa = T2.FKa and T1.PKb = T2.FKb. So the IN versions suggested will not work
since AFAIK IN only works for a single value.
-- Shaul
1. Actually I just tested you suggestion
SELECT COUNT (*) FROM T1 where NOT EXISTS
(SELECT 1 FROM T2 where T1.PK = T2.FK)
and in worked in PG 8.3.8. On a DB with 6M T1 records and 5M T2 records it took 1m8s,
My suggestion, i.e.
SELECT COUNT(*) FROM T1 LEFT JOIN T2 ON T1.PK = T2.FK
WHERE T2.FK IS NULL
was about twice as fast, 37s. (both returned same number of rows, about 2/3 of T1)
However I can use DELETE with your version (instead of "SELECT COUNT (*)" above) but not with mine (can't have LEFT JOIN in DELETE), so YOU WIN. Thanks!
2. BTW. I presented my question earlier in an overly simplified fashion. Sorry. In actuality the two tables are joined on two columns,
say Ka and Kb (a composite key column), e.g. T1.PKa = T2.FKa and T1.PKb = T2.FKb. So the IN versions suggested will not work
since AFAIK IN only works for a single value.
-- Shaul
On Tue, Oct 20, 2009 at 3:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Shaul Dar <shauldar@gmail.com> writes:Well, yeah, but it's unnecessarily inefficient --- why not just
> I assume this will work but will take a long time:
> DELETE * FROM T1 where T1.PK NOT IN
> (SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK)
(SELECT T2.FK FROM T2)
However, that still won't be tremendously fast unless the subselect fits
in work_mem. As of 8.4 this variant should be reasonable:
DELETE FROM T1 where NOT EXISTS
(SELECT 1 FROM T2 where T1.PK = T2.FK)
Pre-8.4 you should resort to the "left join where is null" trick,
but there's no need to be so obscure as of 8.4.
regards, tom lane
On 10/21/09 4:52 AM, "Shaul Dar" <shauldar@gmail.com> wrote: > Tom, > > 1. Actually I just tested you suggestion > > SELECT COUNT (*) FROM T1 where NOT EXISTS > (SELECT 1 FROM T2 where T1.PK <http://t1.pk/> = T2.FK <http://t2.fk/> ) > > and in worked in PG 8.3.8. On a DB with 6M T1 records and 5M T2 records it > took 1m8s, > > My suggestion, i.e. > > SELECT COUNT(*) FROM T1 LEFT JOIN T2 ON T1.PK <http://t1.pk/> = T2.FK > <http://t2.fk/> > WHERE T2.FK <http://t2.fk/> IS NULL > > was about twice as fast, 37s. (both returned same number of rows, about 2/3 of > T1) > > However I can use DELETE with your version (instead of "SELECT COUNT (*)" > above) but not with mine (can't have LEFT JOIN in DELETE), so YOU WIN. Thanks! > > 2. BTW. I presented my question earlier in an overly simplified fashion. > Sorry. In actuality the two tables are joined on two columns, > say Ka and Kb (a composite key column), e.g. T1.PKa = T2.FKa and T1.PKb = > T2.FKb. So the IN versions suggested will not work > since AFAIK IN only works for a single value. The performance will stink in many cases, but IN and NOT IN can work on multiple values, for example: WHERE (a.key1, a.key2) NOT IN (SELECT b.key1, b.key2 FROM b). The fastest (in 8.4) is definitely NOT EXISTS. WHERE NOT EXISTS (SELECT 1 FROM b WHERE (b.key1, b.key2) = (a.key1, a.key2)) I've done this, deleting from tables with 15M + rows where I need a "not in" on two or three columns on multiple other tables. However, NOT EXISTS is only fast if every NOT EXISTS clause is a select on one table, if it is multiple tables and a join, things can get ugly and the planner might not optimize it right. In that case use two NOT EXISTS clauses. Always look at the EXPLAIN plan. With 8.4 -- for performance generally prefer the following: * prefer JOIN and implicit joins to IN and EXISTS. * prefer 'NOT EXISTS' to 'NOT IN' or 'LEFT JOIN where (right is null)' > > -- Shaul > > On Tue, Oct 20, 2009 at 3:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Shaul Dar <shauldar@gmail.com> writes: >>> I assume this will work but will take a long time: >> >>> DELETE * FROM T1 where T1.PK <http://T1.PK> NOT IN >>> (SELECT T1.PK <http://T1.PK> FROM T1, T2 where T1.PK <http://T1.PK> = >>> T2.FK <http://T2.FK> ) >> >> Well, yeah, but it's unnecessarily inefficient --- why not just >> >> DELETE FROM T1 where T1.PK <http://T1.PK> NOT IN >> (SELECT T2.FK <http://T2.FK> FROM T2) >> >> However, that still won't be tremendously fast unless the subselect fits >> in work_mem. As of 8.4 this variant should be reasonable: >> >> DELETE FROM T1 where NOT EXISTS >> (SELECT 1 FROM T2 where T1.PK <http://T1.PK> = T2.FK <http://T2.FK> ) >> >> Pre-8.4 you should resort to the "left join where is null" trick, >> but there's no need to be so obscure as of 8.4. >> >> regards, tom lane > >