Обсуждение: SQL for Deleting all duplicate entries

Поиск
Список
Период
Сортировка

SQL for Deleting all duplicate entries

От
Håkan Jacobsson
Дата:
Hi,

I want to create a DELETE statement which deletes duplicates
in a table.

That is, I want to remove all rows - but one - having three
columns with the same data (more columns exist and there the
data varies).
For example:
column1
column2
column3
column4
column5

column2 = 'test', column3 = 'hey' and column4 IS NULL for
several rows in the table. I want to keep just one of those
rows.

Is this possible? I can't figure it out, so any help MUCH
appreciated!

/regards, Håkan Jacobsson - System developer at Relevant
Traffic, Sweden

Re: SQL for Deleting all duplicate entries

От
"Scott Marlowe"
Дата:
On 9/5/07, Håkan Jacobsson <hakan.jacobsson99@bredband.net> wrote:
> Hi,
>
> I want to create a DELETE statement which deletes duplicates
> in a table.
>
> That is, I want to remove all rows - but one - having three
> columns with the same data (more columns exist and there the
> data varies).

Assuming you've got a KNOWN unique id field (adding one if you don't)
you can do something like:

select * from table t1 join table t2 on (t1.field1=t2.field1 AND
t1.field2=t2.field2 AND t1.field3=t2.field3 AND t1.uid>t2.uid)

That should get the ids of all but one of the matching rows.  then
just use that in a subselect:

begin;
delete from table where uid in (select * from table t1 join table t2
on (t1.field1=t2.field1 AND t1.field2=t2.field2 AND
t1.field3=t2.field3 AND t1.uid>t2.uid) );
(check for dups / lost data)
commit;
or something like that.

Re: SQL for Deleting all duplicate entries

От
"Scott Marlowe"
Дата:
On 9/5/07, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> begin;
> delete from table where uid in (select * from table t1 join table t2
> on (t1.field1=t2.field1 AND t1.field2=t2.field2 AND
> t1.field3=t2.field3 AND t1.uid>t2.uid) );
> (check for dups / lost data)
> commit;


There's a bug up there ^^^

should be:

delete from table where uid in (select t1.uid from...

Re: SQL for Deleting all duplicate entries

От
"Merlin Moncure"
Дата:
On 9/5/07, Håkan Jacobsson <hakan.jacobsson99@bredband.net> wrote:
> Hi,
>
> I want to create a DELETE statement which deletes duplicates
> in a table.
>
> That is, I want to remove all rows - but one - having three
> columns with the same data (more columns exist and there the
> data varies).
> For example:
> column1
> column2
> column3
> column4
> column5
>
> column2 = 'test', column3 = 'hey' and column4 IS NULL for
> several rows in the table. I want to keep just one of those
> rows.
>
> Is this possible? I can't figure it out, so any help MUCH
> appreciated!

when removing duplicates, I find it is usually better to look at this
problem backwards...you want to select out the data you want to keep,
truncate the original table, and insert select the data back in.

What isn't exactly clear from your question is if you are interested
in only particular fields or if you want to throw out based on any
columns (nut just 2, 3, and 4).  If so, this is a highly irregular
(and interesting) problem, and should prove difficult to make
efficient.

If you are only interested in three particular columns, then it's easy.
1. select out data you want to keep using create table scratch SELECT
DISTINCT ON or GROUP BY into scratch
2. truncate main table
3. insert into main select * from scratch

for a more specific answer, you will have to provide some more detail,
especially regarding exactly how you determine two rows as being
'duplicates'.

merlin

Re: SQL for Deleting all duplicate entries

От
Håkan Jacobsson
Дата:
Merlin,

Its just about three columns - not any column. Two columns are
varchars and the third is
a date. The date column value is NULL  for the rows for which
I want to delete the duplicates.

Yes, please, be a bit more specific!

/regards, Håkan Jacobsson


>----Ursprungligt meddelande----
>Från: mmoncure@gmail.com
>Datum: 06-09-2007 01:56
>Till: "Håkan Jacobsson"<hakan.jacobsson99@bredband.net>
>Kopia: <pgsql-general@postgresql.org>
>Ärende: Re: [GENERAL] SQL for Deleting all duplicate entries
>
>On 9/5/07, Håkan Jacobsson <hakan.jacobsson99@bredband.net>
wrote:
>> Hi,
>>
>> I want to create a DELETE statement which deletes
duplicates
>> in a table.
>>
>> That is, I want to remove all rows - but one - having three
>> columns with the same data (more columns exist and there
the
>> data varies).
>> For example:
>> column1
>> column2
>> column3
>> column4
>> column5
>>
>> column2 = 'test', column3 = 'hey' and column4 IS NULL for
>> several rows in the table. I want to keep just one of those
>> rows.
>>
>> Is this possible? I can't figure it out, so any help MUCH
>> appreciated!
>
>when removing duplicates, I find it is usually better to look
at this
>problem backwards...you want to select out the data you want
to keep,
>truncate the original table, and insert select the data back
in.
>
>What isn't exactly clear from your question is if you are
interested
>in only particular fields or if you want to throw out based
on any
>columns (nut just 2, 3, and 4).  If so, this is a highly
irregular
>(and interesting) problem, and should prove difficult to make
>efficient.
>
>If you are only interested in three particular columns, then
it's easy.
>1. select out data you want to keep using create table
scratch SELECT
>DISTINCT ON or GROUP BY into scratch
>2. truncate main table
>3. insert into main select * from scratch
>
>for a more specific answer, you will have to provide some
more detail,
>especially regarding exactly how you determine two rows as
being
>'duplicates'.
>
>merlin
>



Re: SQL for Deleting all duplicate entries

От
"Merlin Moncure"
Дата:
On 9/9/07, Håkan Jacobsson <hakan.jacobsson99@bredband.net> wrote:
> Merlin,
>
> Its just about three columns - not any column. Two columns are
> varchars and the third is
> a date. The date column value is NULL  for the rows for which
> I want to delete the duplicates.


getting ready to go on vacation :).  The idea is you want to write a
query that pulls out the data you want to keep.   If you have a table
with 6 fields, f1 though f6 and you only want one record with
identical values of f1, f2, f3, you might do:

begin;
create temp table scratch as
  select f1, f2, f3, max(f4), max(f5), max(f6) from foo group by f1, f2, f3;

truncate foo;

insert into foo select * from scratch;
commit;

You can replace max() with any suitable aggregate you deem gets you
the best data out of the record.  If you are feeling really clever,
you can write a custom aggregate for the record type (it's easier than
you think!)

merlin

Re: SQL for Deleting all duplicate entries

От
Håkan Jacobsson
Дата:
Thanx Merlin, have a nice one (vacation)!

It turns out I have'nt described the problem accurately=(
Data may actually differ in two of the columns (the varchar
columns).
I still want to remove rows which share the same data in those
two columns and have the date column
set to NULL.
I.e. row 1,2,3  have:
      column1 = 'foo', column2 = 'hey' and the date column =
NULL
      row 4,5,6 have:
      column1 = 'brat', column2 = 'yo' and the date column =
NULL
I want to keep just one of the 1 - 3 rows and one of the 4 - 6
rows..

I will try Merlins and Scotts solutions tomorrow. Anyone know
if I need to modify Merlins and/or Scotts
solutions to solve this new situation?

/best regards, Håkan


>----Ursprungligt meddelande----
>Från: mmoncure@gmail.com
>Datum: 09-09-2007 15:42
>Till: "Håkan Jacobsson"<hakan.jacobsson99@bredband.net>
>Kopia: <pgsql-general@postgresql.org>
>Ärende: Re: [GENERAL] SQL for Deleting all duplicate entries
>
>On 9/9/07, Håkan Jacobsson <hakan.jacobsson99@bredband.net>
wrote:
>> Merlin,
>>
>> Its just about three columns - not any column. Two columns
are
>> varchars and the third is
>> a date. The date column value is NULL  for the rows for
which
>> I want to delete the duplicates.
>
>
>getting ready to go on vacation :).  The idea is you want to
write a
>query that pulls out the data you want to keep.   If you have
a table
>with 6 fields, f1 though f6 and you only want one record with
>identical values of f1, f2, f3, you might do:
>
>begin;
>create temp table scratch as
>  select f1, f2, f3, max(f4), max(f5), max(f6) from foo group
by f1, f2, f3;
>
>truncate foo;
>
>insert into foo select * from scratch;
>commit;
>
>You can replace max() with any suitable aggregate you deem
gets you
>the best data out of the record.  If you are feeling really
clever,
>you can write a custom aggregate for the record type (it's
easier than
>you think!)
>
>merlin
>



Re: SQL for Deleting all duplicate entries

От
brian
Дата:
Håkan Jacobsson wrote:
> Thanx Merlin, have a nice one (vacation)!
>
> It turns out I have'nt described the problem accurately=(
> Data may actually differ in two of the columns (the varchar
> columns).
> I still want to remove rows which share the same data in those
> two columns and have the date column
> set to NULL.
> I.e. row 1,2,3  have:
>       column1 = 'foo', column2 = 'hey' and the date column =
> NULL
>       row 4,5,6 have:
>       column1 = 'brat', column2 = 'yo' and the date column =
> NULL
> I want to keep just one of the 1 - 3 rows and one of the 4 - 6
> rows..
>
> I will try Merlins and Scotts solutions tomorrow. Anyone know
> if I need to modify Merlins and/or Scotts
> solutions to solve this new situation?
>

If i understand correctly, this should give you the records you want to
keep:

SELECT DISTINCT ON (t.one, t.two) t.one, t.two, t.three, [t.n] FROM foo
AS t;

Put those into a tmp table, truncate the original, then put the saved
rows back in.

brian