Обсуждение: Removing duplicate records from a bulk upload
I have just completed the bulk upload of a large database. Some tables have billions of records and no constraints or indexes have been applied yet. About 0.1% of these records may have been duplicated during the upload and I need to remove them before applying constraints.
I understand there are (at least) two approaches to get a table without duplicate records…
- Delete duplicate records from the table based on an appropriate select clause;
- Create a new table with the results from a select distinct clause, and then drop the original table.
What would be the most efficient procedure in PostgreSQL to do the job considering …
- I do not know which records were duplicated;
- There are no indexes applied on tables yet;
- There is no OIDS on tables yet;
- The database is currently 1TB but I have plenty of disk space.
Daniel
On 12/7/2014 9:31 PM, Daniel Begin wrote: > I have just completed the bulk upload of a large database. Some tables > have billions of records and no constraints or indexes have been applied > yet. About 0.1% of these records may have been duplicated during the > upload and I need to remove them before applying constraints. > > I understand there are (at least) two approaches to get a table without > duplicate records… > > - Delete duplicate records from the table based on an > appropriate select clause; > > - Create a new table with the results from a select distinct > clause, and then drop the original table. > > What would be the most efficient procedure in PostgreSQL to do the job > considering … > > - I do not know which records were duplicated; > > - There are no indexes applied on tables yet; > > - There is no OIDS on tables yet; > > - The database is currently 1TB but I have plenty of disk space. > > Daniel > How would you detect duplicate? Is there a single field that would be duplicated? Or do you have to test a bunch of different fields? If its a single field, you could find dups in a single pass of the table with: create index bigtable_key on bigtable(key); select key, count(*) from bigtable group by key having count(*) > 1; Save that list, and decide on some way of deleting the dups. The index might help the initial select, but will really help re-query and delete statements. -Andy
On 12/8/2014 10:30 AM, Andy Colson wrote: > On 12/7/2014 9:31 PM, Daniel Begin wrote: >> I have just completed the bulk upload of a large database. Some tables >> have billions of records and no constraints or indexes have been applied >> yet. About 0.1% of these records may have been duplicated during the >> upload and I need to remove them before applying constraints. >> >> I understand there are (at least) two approaches to get a table without >> duplicate records… >> >> - Delete duplicate records from the table based on an >> appropriate select clause; >> >> - Create a new table with the results from a select distinct >> clause, and then drop the original table. >> >> What would be the most efficient procedure in PostgreSQL to do the job >> considering … >> >> - I do not know which records were duplicated; >> >> - There are no indexes applied on tables yet; >> >> - There is no OIDS on tables yet; >> >> - The database is currently 1TB but I have plenty of disk >> space. >> >> Daniel >> > > How would you detect duplicate? Is there a single field that would be > duplicated? Or do you have to test a bunch of different fields? > > If its a single field, you could find dups in a single pass of the table > with: > > create index bigtable_key on bigtable(key); > select key, count(*) from bigtable group by key having count(*) > 1; > > Save that list, and decide on some way of deleting the dups. > > The index might help the initial select, but will really help re-query > and delete statements. > > -Andy > > I just thought of a more generic way. 1) make a non-unique index on bigtable 2) make a temp table 3) -- copy only dups insert into temp table select * from big table where (its a duplicate); 4) delete from bigtable where keys in (select key from temp); 5) insert into bigtable select distinct from temp; This would minimize the amount of data you have to move around. Depends on how hard step 3 is to write. Index not required but would help both step 3 and 4 be faster. -Andy