Обсуждение: Deleting orphaned records to establish Ref Integrity
I have several large tables (10 million to 200 million rows) that have foreign keys with each other by *convention*, but no actual FOREIGN KEY constraints. Over the course of years, orphaned records (children with no parent) have accumulated and now I want to clean them up. I can't just create the FK constraint because the orphans cause violations. Executing something like the following statement would work, but even with indexes it takes an insane amount of time to execute for each of the tables: DELETE FROM child_table WHERE parentid NOT IN (SELECT parentid FROM parent_table) Are there any better ways to accomplish this task? I've brainstormed extensively and searched the list archives to no avail. It goes without saying that once the cleanup is done, FK constraints will be added so this is never a mess again! I am using PostgreSQL 7.4.6 on Linux, although I could restore these tables on an 8.0.3 server if it would make things go faster! Thanks, Roman _____________________________________ Check All Email Accounts Anywhere! Check your POP3 and webmail account from any PC. With no ads http://www.fusemail.com _____________________________________ Consolidate your email! http://www.fusemail.com
"Roman F" <romanf@fusemail.com> writes: > ... Executing something like > the following statement would work, but even with indexes it takes an > insane amount of time to execute for each of the tables: > DELETE FROM child_table WHERE parentid NOT IN > (SELECT parentid FROM parent_table) Uh, what sort of query plan are you getting for that? PG 7.4 and up can do a reasonable job with NOT IN if the sub-select is small enough to fit into an in-memory hash table (of size sort_mem). I'm betting that your sort_mem setting is not high enough to encourage the planner to try the hash method. You could try increasing sort_mem ... but given the size of your tables, you might end up with a hash table large enough to drive the system into swapping, in which case it'll still be mighty slow. Another idea is to try an outer join: SELECT child_table.parentid INTO tmp_table FROM child_table LEFT JOIN parent_table ON (child_table.parentid = parent_table.parentid) WHERE parent_table.parentid IS NULL; which essentially does a join and then pulls out just the child_table rows that failed to match. This will probably end up getting done via a merge join or hybrid hash join, either of which are more scalable than the NOT IN code. You still have to do the actual deletions in child_table, but as long as there aren't too many, a NOT IN using tmp_table should work OK. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > "Roman F" <romanf@fusemail.com> writes: > > > DELETE FROM child_table WHERE parentid NOT IN > > (SELECT parentid FROM parent_table) > > Another idea is to try an outer join: > > SELECT child_table.parentid INTO tmp_table > FROM child_table LEFT JOIN parent_table > ON (child_table.parentid = parent_table.parentid) > WHERE parent_table.parentid IS NULL; There's also DELETE FROM child_table WHERE NOT EXISTS (select 1 from parent_table where parent_id = child_table.parent_id ) Which won't use anything as efficient as a hash join or merge join but will be at least capable of using index lookups for something basically equivalent to a nested loop. -- greg
>> Another idea is to try an outer join: >> SELECT child_table.parentid INTO tmp_table >> FROM child_table LEFT JOIN parent_table >> ON (child_table.parentid = parent_table.parentid) >> WHERE parent_table.parentid IS NULL; > >There's also >DELETE >FROM child_table >WHERE NOT EXISTS (select 1 >from parent_table >where parent_id = child_table.parent_id >) > > Which won't use anything as efficient as a hash join or merge join but will beat > least capable of using index lookups for something basically equivalent toa > nested loop. Sorry for the delay, I only get to work on this system every now and then. I tried Greg's suggestion and it worked out great. The estimates from EXPLAIN were much larger than the actuals (e.g. 41 hours vs. 2 hours), so I probably have some tuning to do with this dataset. I decided not to try Tom's temp table method because I was afraid the generated table would be very large, so the subsequent DELETE .. WHERE NOT IN (...) would cause swapping again. Thanks all for your help, your insights saved me a lot of headache. Roman _____________________________________ Consolidate your email! http://www.fusemail.com _____________________________________ Check All Email Accounts Anywhere! Check your POP3 and webmail account from any PC. With no ads http://www.fusemail.com