Обсуждение: Slow duplicate deletes
I have a large table (20M records) but mostly short text fields. There are
duplicates that I am trying to remove. I have a bigseriel index that I
unique, but duplicates in another field.
I have an 8 core, 12GB memory computer with RAID disks.
This request has been running for 70 hours (is it safe to kill it?).
How can I make this run faster? This is a one time processing task, but it
is taking a long time.
DELETE FROM portal.metadata
WHERE idx NOT IN
(
SELECT MIN(idx)
FROM portal.metadata
GROUP BY "data_object.unique_id"
);
CREATE TABLE
metadata
(
data_object.unique_id CHARACTER(64) NOT NULL,
size_bytes BIGINT,
object_date TIMESTAMP(6) WITHOUT TIME ZONE,
object_type CHARACTER VARYING(25),
classification CHARACTER VARYING(7),
object_source CHARACTER VARYING(50),
object_managed_date TIMESTAMP(6) WITHOUT TIME ZONE,
clevel INTEGER,
fsctlh CHARACTER VARYING(50),
oname CHARACTER VARYING(40),
description CHARACTER VARYING(80),
horizontal_datum CHARACTER VARYING(20),
do_location CHARACTER VARYING(200),
elevation_ft INTEGER,
location USER-DEFINED,
idx BIGINT DEFAULT nextval('portal.metadata_idx_seq'::regclass) NOT
NULL,
bbox CHARACTER VARYING(160),
CONSTRAINT MetaDataKey PRIMARY KEY (idx)
)
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5537818.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.
Actually, I have 24GB, but my question remains: How can I speed this up? and can I kill the current SQL command (running in pgAdmin). -- View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5537832.html Sent from the PostgreSQL - novice mailing list archive at Nabble.com.
On Mon, Mar 5, 2012 at 9:17 AM, DrYSG <ygutfreund@draper.com> wrote:
> I have a large table (20M records) but mostly short text fields. There are
> duplicates that I am trying to remove. I have a bigseriel index that I
> unique, but duplicates in another field.
>
> I have an 8 core, 12GB memory computer with RAID disks.
>
> This request has been running for 70 hours (is it safe to kill it?).
yes...generally speaking, it's safe to kill just about any query in
postgres any time.
> How can I make this run faster? This is a one time processing task, but it
> is taking a long time.
>
> DELETE FROM portal.metadata
> WHERE idx NOT IN
> (
> SELECT MIN(idx)
> FROM portal.metadata
> GROUP BY "data_object.unique_id"
> );
compare the plan for that query (EXPLAIN) vs this one:
/* delete the records from m1 if there is another record with a lower
idx for the same unique_id */
DELETE FROM portal.metadata m1
WHERE EXISTS
(
SELECT 1 FROM portal.metadata m2
WHERE m1.unique_id = m2.unique_id
AND m2.idx < m1.idx
)
also, if you don't already have one, consider making an index on at
least unqiue_id, or possibly unique_id, idx.
back up your database before running this query :-).
merlin
One point I might not have made clear. The reason I want to remove duplicates is that the column "data_object.unique_id" became non-unique (someone added duplicate rows). So I added the bigSeriel (idx) to uniquely identify the rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of the rows that became duplicated. I am going to try out some of your excellent suggestions. I will report back on how they are working. One idea that was given to me was the following (what do you think Merlin?) CREATE TABLE portal.new_metatdata AS select distinct on (data_object.unique_id) * FROM portal.metadata; Or something of this ilk should be faster because it only need to do a sort on data_object.unique_id and then an insert. After you have verified the results you can do: BEGIN; ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old; ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old; COMMIT; -- View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5538858.html Sent from the PostgreSQL - novice mailing list archive at Nabble.com.
On 5 March 2012 22:43, DrYSG <ygutfreund@draper.com> wrote: > One point I might not have made clear. The reason I want to remove duplicates > is that the column "data_object.unique_id" became non-unique (someone added > duplicate rows). So I added the bigSeriel (idx) to uniquely identify the > rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of > the rows that became duplicated. > > I am going to try out some of your excellent suggestions. I will report back > on how they are working. > > One idea that was given to me was the following (what do you think Merlin?) > > CREATE TABLE portal.new_metatdata AS > select distinct on (data_object.unique_id) * FROM portal.metadata; > > Or something of this ilk should be faster because it only need to do a > sort on data_object.unique_id and then an insert. After you have > verified the results you can do: > > BEGIN; > ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old; > ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old; > COMMIT; This sounds like a good way to go, but if you have foreign keys pointing at portal.metadata I think you will need to drop and recreate them again after the rename. -- Michael Wood <esiotrot@gmail.com>
On Mon, Mar 5, 2012 at 2:43 PM, DrYSG <ygutfreund@draper.com> wrote: > One point I might not have made clear. The reason I want to remove duplicates > is that the column "data_object.unique_id" became non-unique (someone added > duplicate rows). So I added the bigSeriel (idx) to uniquely identify the > rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of > the rows that became duplicated. > > I am going to try out some of your excellent suggestions. I will report back > on how they are working. > > One idea that was given to me was the following (what do you think Merlin?) > > CREATE TABLE portal.new_metatdata AS > select distinct on (data_object.unique_id) * FROM portal.metadata; sure that will work, but as Michael noted it's not always practical to do that. Also, if a fairly small percentage of the records have to be deleted, an in-place delete may end up being faster anyways. Modern postgres is pretty smart at optimizing 'where exists' and you should get a decent plan. merlin
The following query took only 16 hours 2 minutes CREATE TABLE portal.new_metatdata AS select distinct on (data_object.unique_id) * FROM portal.metadata; Thank you. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5544386.html Sent from the PostgreSQL - novice mailing list archive at Nabble.com.