Обсуждение: Removing duplicates
I have used this method successfully on another table but this one is not working.
I have a table of nearly 800 million records with some duplicates in.select rart_id, r9, ra, ry, rw, rv, rp, rs, ri
from isi.rcited_ref
where rart_id = 'A1986D733500019';
renders a result of 72 records. When I do
select count(id), rart_id, r9, ra, ry, rw, rv, rp, rs, ri
from isi.rcited_ref
where rart_id = 'A1986D733500019'
group by rart_id, r9, ra, ry, rw, rv, rp, rs, ri;
select count(id), rart_id, r9, ra, ry, rw, rv, rp, rs, ri
from isi.rcited_ref
where rart_id = 'A1986D733500019'
group by rart_id, r9, ra, ry, rw, rv, rp, rs, ri;
It shows that there are 36 duplicates with this rart_id.
So as a test I did the following (the id-field is the primary key):
DELETE FROM isi.rcited_ref q
WHERE EXISTS(SELECT 1 FROM isi.rcited_ref q1
WHERE q1.id != q.id
AND q.rart_id = q1.rart_id
AND q.r9 = q1.r9
AND q.ra = q1.ra
AND q.ry = q1.ry
AND q.rw = q1.rw
AND q.rv = q1.rv
AND q.rp = q1.rp
AND q.rs = q1.rs
AND q.rart_id = 'A1986D733500019'
);
DELETE FROM isi.rcited_ref q
WHERE EXISTS(SELECT 1 FROM isi.rcited_ref q1
WHERE q1.id != q.id
AND q.rart_id = q1.rart_id
AND q.r9 = q1.r9
AND q.ra = q1.ra
AND q.ry = q1.ry
AND q.rw = q1.rw
AND q.rv = q1.rv
AND q.rp = q1.rp
AND q.rs = q1.rs
AND q.rart_id = 'A1986D733500019'
);
But that deletes none. And I cannot see what went wrong.
I have also tried the same query with ctid without success:
DELETE FROM isi.rcited_ref q
WHERE EXISTS(SELECT 1 FROM isi.rcited_ref q1
WHERE q1.ctid < q.ctid
AND q.rart_id = q1.rart_id
AND q.r9 = q1.r9
AND q.ra = q1.ra
AND q.ry = q1.ry
AND q.rw = q1.rw
AND q.rv = q1.rv
AND q.rp = q1.rp
AND q.rs = q1.rs
AND q.rart_id = 'A1986D733500019'
);
DELETE FROM isi.rcited_ref q
WHERE EXISTS(SELECT 1 FROM isi.rcited_ref q1
WHERE q1.ctid < q.ctid
AND q.rart_id = q1.rart_id
AND q.r9 = q1.r9
AND q.ra = q1.ra
AND q.ry = q1.ry
AND q.rw = q1.rw
AND q.rv = q1.rv
AND q.rp = q1.rp
AND q.rs = q1.rs
AND q.rart_id = 'A1986D733500019'
);
The size of the table makes it difficult to use a 'group by' method to delete all duplcates.
What am I doing wrong?
Regards
Johann
Regards
Johann
--
Because experiencing your loyal love is better than life itself,
my lips will praise you. (Psalm 63:3)
Because experiencing your loyal love is better than life itself,
my lips will praise you. (Psalm 63:3)
Dear Johann,
I tried (with PostgreSQL 9.2) to run the two DELETE statements you describe in your mail (the first based on the "id" field, the second on the ctid) and they work! I have to point out that if you use the DELETE based on the "id" field YOU'LL DELETE ALL RECORDS having at least one duplicate.
The "q1.id != q.id" doesn't work because query doesn't know yet its result so every row with a duplicate will match without exception.
If you use the DELETE based on ctid order you delete only records with the same "rart_id", keeping the record with the lowest "id".
Remember that if you run the first DELETE query you remove more data than you expect.
Regards,
Giuseppe.
Il 10/07/2013 09:11, Johann Spies ha scritto:
I tried (with PostgreSQL 9.2) to run the two DELETE statements you describe in your mail (the first based on the "id" field, the second on the ctid) and they work! I have to point out that if you use the DELETE based on the "id" field YOU'LL DELETE ALL RECORDS having at least one duplicate.
The "q1.id != q.id" doesn't work because query doesn't know yet its result so every row with a duplicate will match without exception.
If you use the DELETE based on ctid order you delete only records with the same "rart_id", keeping the record with the lowest "id".
Remember that if you run the first DELETE query you remove more data than you expect.
Regards,
Giuseppe.
Il 10/07/2013 09:11, Johann Spies ha scritto:
I have used this method successfully on another table but this one is not working. I have a table of nearly 800 million records with some duplicates in. Here is an example: select rart_id, r9, ra, ry, rw, rv, rp, rs, ri from isi.rcited_ref where rart_id = 'A1986D733500019'; renders a result of 72 records. When I do select count(id), rart_id, r9, ra, ry, rw, rv, rp, rs, ri from isi.rcited_ref where rart_id = 'A1986D733500019' group by rart_id, r9, ra, ry, rw, rv, rp, rs, ri; It shows that there are 36 duplicates with this rart_id. So as a test I did the following (the id-field is the primary key): DELETE FROM isi.rcited_ref q WHERE EXISTS(SELECT 1 FROM isi.rcited_ref q1 WHERE q1.id != q.id AND q.rart_id = q1.rart_id AND q.r9 = q1.r9 AND q.ra = q1.ra AND q.ry = q1.ry AND q.rw = q1.rw AND q.rv = q1.rv AND q.rp = q1.rp AND q.rs = q1.rs AND q.rart_id = 'A1986D733500019' ); But that deletes none. And I cannot see what went wrong. I have also tried the same query with ctid without success: DELETE FROM isi.rcited_ref q WHERE EXISTS(SELECT 1 FROM isi.rcited_ref q1 WHERE q1.ctid < q.ctid AND q.rart_id = q1.rart_id AND q.r9 = q1.r9 AND q.ra = q1.ra AND q.ry = q1.ry AND q.rw = q1.rw AND q.rv = q1.rv AND q.rp = q1.rp AND q.rs = q1.rs AND q.rart_id = 'A1986D733500019' ); The size of the table makes it difficult to use a 'group by' method to delete all duplcates. What am I doing wrong? Regards Johann -- Because experiencing your loyal love is better than life itself, my lips will praise you. (Psalm 63:3)
Hi,
You can try following query on test database to remove duplicates.
DELETE FROM isi.rcited_ref a
WHERE a.ctid <> (SELECT min(ctid) FROM isi.rcited_ref b
WHERE a.rart_id = b.rart_id
AND a.r9 = b.r9
AND a.ra = b.ra
AND a.ry = b.ry
AND a.rw = b.rw
AND a.rv = b.rv
AND a.rp = b.rp
AND a.rs = b.rs
AND a.rart_id = 'A1986D733500019'
);
Regards,
Ketana
From: Johann Spies <johann.spies@gmail.com>
To: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Wednesday, 10 July 2013 5:11 AM
Subject: [GENERAL] Removing duplicates
I have used this method successfully on another table but this one is not working.
I have a table of nearly 800 million records with some duplicates in.select rart_id, r9, ra, ry, rw, rv, rp, rs, ri
from isi.rcited_ref
where rart_id = 'A1986D733500019';
renders a result of 72 records. When I do
select count(id), rart_id, r9, ra, ry, rw, rv, rp, rs, ri
from isi.rcited_ref
where rart_id = 'A1986D733500019'
group by rart_id, r9, ra, ry, rw, rv, rp, rs, ri;
select count(id), rart_id, r9, ra, ry, rw, rv, rp, rs, ri
from isi.rcited_ref
where rart_id = 'A1986D733500019'
group by rart_id, r9, ra, ry, rw, rv, rp, rs, ri;
It shows that there are 36 duplicates with this rart_id.
So as a test I did the following (the id-field is the primary key):
DELETE FROM isi.rcited_ref q
WHERE EXISTS(SELECT 1 FROM isi.rcited_ref q1
WHERE q1.id != q.id
AND q.rart_id = q1.rart_id
AND q.r9 = q1.r9
AND q.ra = q1.ra
AND q.ry = q1.ry
AND q.rw = q1.rw
AND q.rv = q1.rv
AND q.rp = q1.rp
AND q.rs = q1.rs
AND q.rart_id = 'A1986D733500019'
);
DELETE FROM isi.rcited_ref q
WHERE EXISTS(SELECT 1 FROM isi.rcited_ref q1
WHERE q1.id != q.id
AND q.rart_id = q1.rart_id
AND q.r9 = q1.r9
AND q.ra = q1.ra
AND q.ry = q1.ry
AND q.rw = q1.rw
AND q.rv = q1.rv
AND q.rp = q1.rp
AND q.rs = q1.rs
AND q.rart_id = 'A1986D733500019'
);
But that deletes none. And I cannot see what went wrong.
I have also tried the same query with ctid without success:
DELETE FROM isi.rcited_ref q
WHERE EXISTS(SELECT 1 FROM isi.rcited_ref q1
WHERE q1.ctid < q.ctid
AND q.rart_id = q1.rart_id
AND q.r9 = q1.r9
AND q.ra = q1.ra
AND q.ry = q1.ry
AND q.rw = q1.rw
AND q.rv = q1.rv
AND q.rp = q1.rp
AND q.rs = q1.rs
AND q.rart_id = 'A1986D733500019'
);
DELETE FROM isi.rcited_ref q
WHERE EXISTS(SELECT 1 FROM isi.rcited_ref q1
WHERE q1.ctid < q.ctid
AND q.rart_id = q1.rart_id
AND q.r9 = q1.r9
AND q.ra = q1.ra
AND q.ry = q1.ry
AND q.rw = q1.rw
AND q.rv = q1.rv
AND q.rp = q1.rp
AND q.rs = q1.rs
AND q.rart_id = 'A1986D733500019'
);
The size of the table makes it difficult to use a 'group by' method to delete all duplcates.
What am I doing wrong?
Regards
Johann
Regards
Johann
--
Because experiencing your loyal love is better than life itself,
my lips will praise you. (Psalm 63:3)
Because experiencing your loyal love is better than life itself,
my lips will praise you. (Psalm 63:3)