Обсуждение: unique rows
I want to make a table that has unique rows - no problem. Create Table x (smarts Varchar Unique); I have input data that has many duplicates and until now I was creating a temp table allowing duplicates, and then Insert into x (smarts) select distinct smarts from tmpx; This is now failing when I have hundreds of millions of rows, running out of temp space (while trying to sort to accomplish the distinct?) So, is there a way (a constraint, a check?) that will simply REJECT a duplicate when I insert the raw data directly into x rather than FAILING on an error as it does with the unique constraint above? Thanks, TJ O'Donnell
HI, 1. You can try to create a trigger before insert to check if the new row exists in the table and if it does to cancel the insert. 2. Or create a UNIQUE INDEX on all the columns which will throw error if you try to insert duplicate rows in the table. Then in the apllication software just catch the error and move on with the next insert. I hope I helped. Regards, Kaloyan Iliev TJ O'Donnell wrote: > I want to make a table that has unique rows - no problem. > Create Table x (smarts Varchar Unique); > > I have input data that has many duplicates and until now > I was creating a temp table allowing duplicates, and then > Insert into x (smarts) select distinct smarts from tmpx; > This is now failing when I have hundreds of millions of > rows, running out of temp space (while trying to sort to > accomplish the distinct?) > > So, is there a way (a constraint, a check?) that will simply > REJECT a duplicate when I insert the raw data directly into x > rather than FAILING on an error as it does > with the unique constraint above? > > Thanks, > TJ O'Donnell > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > >
Hi, TJ, TJ O'Donnell wrote: > So, is there a way (a constraint, a check?) that will simply > REJECT a duplicate when I insert the raw data directly into x > rather than FAILING on an error as it does > with the unique constraint above? Failing on an error is exactly the way PostgreSQL (and the SQL standard) uses to REJECT duplicates. :-) You seem to think about silently dropping the duplicates. That could be achieved with an BEFORE INSERT trigger, or with a rule on a view, as both can silently drop the inserted rule. The trigger body could even be in language SQL, along the lines of: SELECT CASE WHEN EXISTS (SELECT keycol FROM table WHERE table.keycol=NEW.keycol) THEN NULL ELSE NEW; Nevertheless, expect the insert performance to drop a little, due to the trigger overhead. The alternative approaches (SELECT'ing from the application, using a stored procedure that checks and then inserts the data, and using subtransactions to roll back the failing inserts) all seem worse (uglier and slower) to me, but don't hesitate to ask if you're interested. HTH, Markus
Markus, Kaloyan, and all Thanks for the help on this. I got the trigger to work properly, although I discovered that a trigger in SQL is not allowed, so I wrote it in plsql. It was very, very slow. So I finally decided to filter the input before attempting to Copy it into the table, using a perl hash to ensure uniqueness. As a side benefit, I was able to count the frequency of each input string while I was filtering and include that in the final table. TJ O'Donnell Markus Schaber wrote: > Hi, TJ, > > TJ O'Donnell wrote: > >> So, is there a way (a constraint, a check?) that will simply >> REJECT a duplicate when I insert the raw data directly into x >> rather than FAILING on an error as it does >> with the unique constraint above? > > Failing on an error is exactly the way PostgreSQL (and the SQL standard) > uses to REJECT duplicates. :-) > > You seem to think about silently dropping the duplicates. That could be > achieved with an BEFORE INSERT trigger, or with a rule on a view, as > both can silently drop the inserted rule. > > The trigger body could even be in language SQL, along the lines of: > > SELECT CASE WHEN EXISTS (SELECT keycol FROM table WHERE > table.keycol=NEW.keycol) THEN NULL ELSE NEW; > > Nevertheless, expect the insert performance to drop a little, due to the > trigger overhead. > > The alternative approaches (SELECT'ing from the application, using a > stored procedure that checks and then inserts the data, and using > subtransactions to roll back the failing inserts) all seem worse (uglier > and slower) to me, but don't hesitate to ask if you're interested. > > HTH, > Markus >
On Wed, Sep 20, 2006 at 22:18:04 -0700, TJ O'Donnell <tjo@acm.org> wrote: > I want to make a table that has unique rows - no problem. > Create Table x (smarts Varchar Unique); > > I have input data that has many duplicates and until now > I was creating a temp table allowing duplicates, and then > Insert into x (smarts) select distinct smarts from tmpx; > This is now failing when I have hundreds of millions of > rows, running out of temp space (while trying to sort to > accomplish the distinct?) Another option would be to do a select with a GROUP BY instead of DISTINCT. That would allow for a hash aggregate plan (don't forget to analyse the temp table first) which would use less memory.