Обсуждение: Deleting orphaned records to establish Ref Integrity

Поиск
Список
Период
Сортировка

Deleting orphaned records to establish Ref Integrity

От
"Roman F"
Дата:
I have several large tables (10 million to 200 million rows) that have
foreign keys with each other by *convention*, but no actual FOREIGN KEY
constraints.

Over the course of years, orphaned records (children with no parent) have
accumulated and now I want to clean them up.  I can't just create the FK
constraint because the orphans cause violations.  Executing something like
the following statement would work, but even with indexes it takes an
insane amount of time to execute for each of the tables:

DELETE FROM child_table WHERE parentid NOT IN
  (SELECT parentid FROM parent_table)

Are there any better ways to accomplish this task?  I've brainstormed
extensively and searched the list archives to no avail.  It goes without
saying that once the cleanup is done, FK constraints will be added so this
is never a mess again!

I am using PostgreSQL 7.4.6 on Linux, although I could restore these
tables on an 8.0.3 server if it would make things go faster!

Thanks,
Roman

_____________________________________
Check All Email Accounts Anywhere!
Check your POP3 and webmail account
from any PC. With no ads
http://www.fusemail.com




_____________________________________
Consolidate your email!
http://www.fusemail.com


Re: Deleting orphaned records to establish Ref Integrity

От
Tom Lane
Дата:
"Roman F" <romanf@fusemail.com> writes:
> ...  Executing something like
> the following statement would work, but even with indexes it takes an
> insane amount of time to execute for each of the tables:

> DELETE FROM child_table WHERE parentid NOT IN
>   (SELECT parentid FROM parent_table)

Uh, what sort of query plan are you getting for that?

PG 7.4 and up can do a reasonable job with NOT IN if the sub-select is
small enough to fit into an in-memory hash table (of size sort_mem).
I'm betting that your sort_mem setting is not high enough to encourage
the planner to try the hash method.  You could try increasing sort_mem
... but given the size of your tables, you might end up with a hash
table large enough to drive the system into swapping, in which case
it'll still be mighty slow.

Another idea is to try an outer join:

    SELECT child_table.parentid INTO tmp_table
    FROM child_table LEFT JOIN parent_table
         ON (child_table.parentid = parent_table.parentid)
    WHERE parent_table.parentid IS NULL;

which essentially does a join and then pulls out just the child_table
rows that failed to match.  This will probably end up getting done via a
merge join or hybrid hash join, either of which are more scalable than
the NOT IN code.  You still have to do the actual deletions in
child_table, but as long as there aren't too many, a NOT IN using
tmp_table should work OK.

            regards, tom lane

Re: Deleting orphaned records to establish Ref Integrity

От
Greg Stark
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> "Roman F" <romanf@fusemail.com> writes:
>
> > DELETE FROM child_table WHERE parentid NOT IN
> >   (SELECT parentid FROM parent_table)
>
> Another idea is to try an outer join:
>
>     SELECT child_table.parentid INTO tmp_table
>     FROM child_table LEFT JOIN parent_table
>          ON (child_table.parentid = parent_table.parentid)
>     WHERE parent_table.parentid IS NULL;

There's also

DELETE
  FROM child_table
 WHERE NOT EXISTS (select 1
                     from parent_table
                    where parent_id = child_table.parent_id
                  )


Which won't use anything as efficient as a hash join or merge join but will be
at least capable of using index lookups for something basically equivalent to
a nested loop.



--
greg

Re: Deleting orphaned records to establish Ref Integrity

От
"Roman F"
Дата:
>> Another idea is to try an outer join:
>> SELECT child_table.parentid INTO tmp_table
>> FROM child_table LEFT JOIN parent_table
>> ON (child_table.parentid = parent_table.parentid)
>> WHERE parent_table.parentid IS NULL;
>
>There's also
>DELETE
>FROM child_table
>WHERE NOT EXISTS (select 1
>from parent_table
>where parent_id = child_table.parent_id
>)
>
> Which won't use anything as efficient as a hash join or merge join but
will beat
> least capable of using index lookups for something basically equivalent toa
> nested loop.

Sorry for the delay, I only get to work on this system every now and then.

I tried Greg's suggestion and it worked out great.  The estimates from
EXPLAIN were much larger than the actuals (e.g. 41 hours vs. 2 hours), so
I probably have some tuning to do with this dataset.  I decided not to try
Tom's temp table method because I was afraid the generated table would be
very large, so the subsequent DELETE .. WHERE NOT IN (...) would cause
swapping again.

Thanks all for your help, your insights saved me a lot of headache.

Roman


_____________________________________
Consolidate your email!
http://www.fusemail.com





_____________________________________
Check All Email Accounts Anywhere!
Check your POP3 and webmail account
from any PC. With no ads
http://www.fusemail.com