Обсуждение: Finding rows in table T1 that DO NOT MATCH any row in table T2

Поиск
Список
Период
Сортировка

Finding rows in table T1 that DO NOT MATCH any row in table T2

От
Shaul Dar
Дата:
Hi,

I have two large tables T1 and T2, such that T2 has a FK to T1 (i.e. T2.FK --> T1.PK, possibly multiple T2 rows may reference the same T1 row). I have deleted about 2/3 of table T2. I now want to delete all rows in T1 that are not referenced by T2, i.e. all rows in T1 that cannot join with (any row in) T2 on the condition T2.FK = T1.PK (the opposite of a join...)

I assume this will work but will take a long time:

DELETE * FROM T1 where T1.PK NOT IN
(SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK)

What is an efficient way to do this?
Thanks,

-- Shaul

Re: Finding rows in table T1 that DO NOT MATCH any row in table T2

От
Shaul Dar
Дата:
How about:

DELETE * FROM T1 LEFT JOIN T2 ON T1.PK = T2.FK
WHERE T2.FK IS NULL

Shaul


On Tue, Oct 20, 2009 at 2:37 PM, Shaul Dar <shauldar@gmail.com> wrote:
Hi,

I have two large tables T1 and T2, such that T2 has a FK to T1 (i.e. T2.FK --> T1.PK, possibly multiple T2 rows may reference the same T1 row). I have deleted about 2/3 of table T2. I now want to delete all rows in T1 that are not referenced by T2, i.e. all rows in T1 that cannot join with (any row in) T2 on the condition T2.FK = T1.PK (the opposite of a join...)

I assume this will work but will take a long time:

DELETE * FROM T1 where T1.PK NOT IN
(SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK)

What is an efficient way to do this?
Thanks,

-- Shaul


Re: Finding rows in table T1 that DO NOT MATCH any row in table T2

От
"A. Kretschmer"
Дата:
In response to Shaul Dar :
> Hi,
>
> I have two large tables T1 and T2, such that T2 has a FK to T1 (i.e. T2.FK -->
> T1.PK, possibly multiple T2 rows may reference the same T1 row). I have deleted
> about 2/3 of table T2. I now want to delete all rows in T1 that are not
> referenced by T2, i.e. all rows in T1 that cannot join with (any row in) T2 on
> the condition T2.FK = T1.PK (the opposite of a join...)
>
> I assume this will work but will take a long time:
>
> DELETE * FROM T1 where T1.PK NOT IN
> (SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK)
>
> What is an efficient way to do this?
> Thanks,

Maybe this one:

(my id is your pk):

delete from t1 where t1.id in (select t1.id from t1 left join t2 using
(id) where t2.id is null);

Try it, and/or use explain for both versions and see which which is
faster.

Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)

Re: Finding rows in table T1 that DO NOT MATCH any row in table T2

От
Tom Lane
Дата:
Shaul Dar <shauldar@gmail.com> writes:
> I assume this will work but will take a long time:

> DELETE * FROM T1 where T1.PK NOT IN
> (SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK)

Well, yeah, but it's unnecessarily inefficient --- why not just

DELETE FROM T1 where T1.PK NOT IN
(SELECT T2.FK FROM T2)

However, that still won't be tremendously fast unless the subselect fits
in work_mem.  As of 8.4 this variant should be reasonable:

DELETE FROM T1 where NOT EXISTS
(SELECT 1 FROM T2 where T1.PK = T2.FK)

Pre-8.4 you should resort to the "left join where is null" trick,
but there's no need to be so obscure as of 8.4.

            regards, tom lane

Re: Finding rows in table T1 that DO NOT MATCH any row in table T2

От
Melton Low
Дата:
How about

DELETE FROM T1 WHERE T1.PK IN
(SELECT T1.PK FROM T1 EXCEPT SELECT T2.FK FROM T2);

Mel

On Tue, Oct 20, 2009 at 7:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Shaul Dar <shauldar@gmail.com> writes:
> I assume this will work but will take a long time:

> DELETE * FROM T1 where T1.PK NOT IN
> (SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK)

Well, yeah, but it's unnecessarily inefficient --- why not just

DELETE FROM T1 where T1.PK NOT IN
(SELECT T2.FK FROM T2)

However, that still won't be tremendously fast unless the subselect fits
in work_mem.  As of 8.4 this variant should be reasonable:

DELETE FROM T1 where NOT EXISTS
(SELECT 1 FROM T2 where T1.PK = T2.FK)

Pre-8.4 you should resort to the "left join where is null" trick,
but there's no need to be so obscure as of 8.4.

                       regards, tom lane

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Finding rows in table T1 that DO NOT MATCH any row in table T2

От
Shaul Dar
Дата:
Tom,

1. Actually I just tested you suggestion

SELECT COUNT (*) FROM T1 where NOT EXISTS
(SELECT 1 FROM T2 where T1.PK = T2.FK)

and in worked in PG 8.3.8. On a DB with 6M T1 records and 5M T2 records it took 1m8s,

My suggestion, i.e.

SELECT COUNT(*) FROM T1 LEFT JOIN T2 ON T1.PK = T2.FK
WHERE T2.FK IS NULL

was about twice as fast, 37s. (both returned same number of rows, about 2/3 of T1)

However I can use DELETE with your version (instead of "SELECT COUNT (*)" above) but not with mine (can't have LEFT JOIN in DELETE), so YOU WIN. Thanks!

2. BTW. I presented my question earlier in an overly simplified fashion. Sorry. In actuality the two tables are joined on two columns,
say Ka and Kb (a composite key column), e.g. T1.PKa = T2.FKa and T1.PKb = T2.FKb. So the IN versions suggested will not work
since AFAIK IN only works for a single value.

-- Shaul

On Tue, Oct 20, 2009 at 3:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Shaul Dar <shauldar@gmail.com> writes:
> I assume this will work but will take a long time:

> DELETE * FROM T1 where T1.PK NOT IN
> (SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK)

Well, yeah, but it's unnecessarily inefficient --- why not just

DELETE FROM T1 where T1.PK NOT IN
(SELECT T2.FK FROM T2)

However, that still won't be tremendously fast unless the subselect fits
in work_mem.  As of 8.4 this variant should be reasonable:

DELETE FROM T1 where NOT EXISTS
(SELECT 1 FROM T2 where T1.PK = T2.FK)

Pre-8.4 you should resort to the "left join where is null" trick,
but there's no need to be so obscure as of 8.4.

                       regards, tom lane

Re: Finding rows in table T1 that DO NOT MATCH any row in table T2

От
Scott Carey
Дата:
On 10/21/09 4:52 AM, "Shaul Dar" <shauldar@gmail.com> wrote:

> Tom,
>
> 1. Actually I just tested you suggestion
>
> SELECT COUNT (*) FROM T1 where NOT EXISTS
> (SELECT 1 FROM T2 where T1.PK <http://t1.pk/>  = T2.FK <http://t2.fk/> )
>
> and in worked in PG 8.3.8. On a DB with 6M T1 records and 5M T2 records it
> took 1m8s,
>
> My suggestion, i.e.
>
> SELECT COUNT(*) FROM T1 LEFT JOIN T2 ON T1.PK <http://t1.pk/>  = T2.FK
> <http://t2.fk/>
> WHERE T2.FK <http://t2.fk/>  IS NULL
>
> was about twice as fast, 37s. (both returned same number of rows, about 2/3 of
> T1)
>
> However I can use DELETE with your version (instead of "SELECT COUNT (*)"
> above) but not with mine (can't have LEFT JOIN in DELETE), so YOU WIN. Thanks!
>
> 2. BTW. I presented my question earlier in an overly simplified fashion.
> Sorry. In actuality the two tables are joined on two columns,
> say Ka and Kb (a composite key column), e.g. T1.PKa = T2.FKa and T1.PKb =
> T2.FKb. So the IN versions suggested will not work
> since AFAIK IN only works for a single value.

The performance will stink in many cases, but IN and NOT IN can work on
multiple values, for example:

WHERE (a.key1, a.key2) NOT IN (SELECT b.key1, b.key2 FROM b).

The fastest (in 8.4) is definitely NOT EXISTS.

WHERE NOT EXISTS (SELECT 1 FROM b WHERE (b.key1, b.key2) = (a.key1, a.key2))

  I've done this, deleting from tables with 15M + rows where I need a "not
in" on two or three columns on multiple other tables.
However, NOT EXISTS is only fast if every NOT EXISTS clause is a select on
one table, if it is multiple tables and a join, things can get ugly and the
planner might not optimize it right.  In that case use two NOT EXISTS
clauses.  Always look at the EXPLAIN plan.

With 8.4 -- for performance generally prefer the following:
* prefer JOIN and implicit joins to IN and EXISTS.
* prefer 'NOT EXISTS' to 'NOT IN' or  'LEFT JOIN where (right is null)'


>
> -- Shaul
>
> On Tue, Oct 20, 2009 at 3:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Shaul Dar <shauldar@gmail.com> writes:
>>> I assume this will work but will take a long time:
>>
>>> DELETE * FROM T1 where T1.PK <http://T1.PK>  NOT IN
>>> (SELECT T1.PK <http://T1.PK>  FROM T1, T2 where T1.PK <http://T1.PK>  =
>>> T2.FK <http://T2.FK> )
>>
>> Well, yeah, but it's unnecessarily inefficient --- why not just
>>
>> DELETE FROM T1 where T1.PK <http://T1.PK>  NOT IN
>> (SELECT T2.FK <http://T2.FK>  FROM T2)
>>
>> However, that still won't be tremendously fast unless the subselect fits
>> in work_mem.  As of 8.4 this variant should be reasonable:
>>
>> DELETE FROM T1 where NOT EXISTS
>> (SELECT 1 FROM T2 where T1.PK <http://T1.PK>  = T2.FK <http://T2.FK> )
>>
>> Pre-8.4 you should resort to the "left join where is null" trick,
>> but there's no need to be so obscure as of 8.4.
>>
>>                         regards, tom lane
>
>