Обсуждение: unique rows

Поиск
Список
Период
Сортировка

unique rows

От
TJ O'Donnell
Дата:
I want to make a table that has unique rows - no problem.
Create Table x (smarts Varchar Unique);

I have input data that has many duplicates and until now
I was creating a temp table allowing duplicates, and then
Insert into x (smarts) select distinct smarts from tmpx;
This is now failing when I have hundreds of millions of
rows, running out of temp space (while trying to sort to
accomplish the distinct?)

So, is there a way (a constraint, a check?) that will simply
REJECT a duplicate when I insert the raw data directly into x
rather than FAILING on an error as it does
with the unique constraint above?

Thanks,
TJ O'Donnell


Re: unique rows

От
Kaloyan Iliev
Дата:
HI,

1. You can try to create a trigger before insert to check if the new row 
exists in the table and if it does to cancel the insert.
2. Or create a UNIQUE INDEX on all the columns which will throw error if 
you try to insert duplicate rows in the table.
Then in the apllication software just catch the error and move on with 
the next insert.

I hope I helped.

Regards,
 Kaloyan Iliev


TJ O'Donnell wrote:

> I want to make a table that has unique rows - no problem.
> Create Table x (smarts Varchar Unique);
>
> I have input data that has many duplicates and until now
> I was creating a temp table allowing duplicates, and then
> Insert into x (smarts) select distinct smarts from tmpx;
> This is now failing when I have hundreds of millions of
> rows, running out of temp space (while trying to sort to
> accomplish the distinct?)
>
> So, is there a way (a constraint, a check?) that will simply
> REJECT a duplicate when I insert the raw data directly into x
> rather than FAILING on an error as it does
> with the unique constraint above?
>
> Thanks,
> TJ O'Donnell
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
>



Re: unique rows

От
Markus Schaber
Дата:
Hi, TJ,

TJ O'Donnell wrote:

> So, is there a way (a constraint, a check?) that will simply
> REJECT a duplicate when I insert the raw data directly into x
> rather than FAILING on an error as it does
> with the unique constraint above?

Failing on an error is exactly the way PostgreSQL (and the SQL standard)
uses to REJECT duplicates. :-)

You seem to think about silently dropping the duplicates. That could be
achieved with an BEFORE INSERT trigger, or with a rule on a view, as
both can silently drop the inserted rule.

The trigger body could even be in language SQL, along the lines of:

SELECT CASE WHEN EXISTS (SELECT keycol FROM table WHERE
table.keycol=NEW.keycol) THEN NULL ELSE NEW;

Nevertheless, expect the insert performance to drop a little, due to the
trigger overhead.

The alternative approaches (SELECT'ing from the application, using a
stored procedure that checks and then inserts the data, and using
subtransactions to roll back the failing inserts) all seem worse (uglier
and slower) to me, but don't hesitate to ask if you're interested.

HTH,
Markus




Re: unique rows

От
TJ O'Donnell
Дата:
Markus, Kaloyan, and all

Thanks for the help on this.  I got the trigger to work properly,
although I discovered that a trigger in SQL is not allowed, so I
wrote it in plsql.  It was very, very slow.  So I finally decided
to filter the input before attempting to Copy it into the table,
using a perl hash to ensure uniqueness.  As a side benefit, I was able 
to count the frequency of each input string while I was filtering and 
include that in the final table.

TJ O'Donnell

Markus Schaber wrote:
> Hi, TJ,
> 
> TJ O'Donnell wrote:
> 
>> So, is there a way (a constraint, a check?) that will simply
>> REJECT a duplicate when I insert the raw data directly into x
>> rather than FAILING on an error as it does
>> with the unique constraint above?
> 
> Failing on an error is exactly the way PostgreSQL (and the SQL standard)
> uses to REJECT duplicates. :-)
> 
> You seem to think about silently dropping the duplicates. That could be
> achieved with an BEFORE INSERT trigger, or with a rule on a view, as
> both can silently drop the inserted rule.
> 
> The trigger body could even be in language SQL, along the lines of:
> 
> SELECT CASE WHEN EXISTS (SELECT keycol FROM table WHERE
> table.keycol=NEW.keycol) THEN NULL ELSE NEW;
> 
> Nevertheless, expect the insert performance to drop a little, due to the
> trigger overhead.
> 
> The alternative approaches (SELECT'ing from the application, using a
> stored procedure that checks and then inserts the data, and using
> subtransactions to roll back the failing inserts) all seem worse (uglier
> and slower) to me, but don't hesitate to ask if you're interested.
> 
> HTH,
> Markus
> 


Re: unique rows

От
Bruno Wolff III
Дата:
On Wed, Sep 20, 2006 at 22:18:04 -0700, TJ O'Donnell <tjo@acm.org> wrote:
> I want to make a table that has unique rows - no problem.
> Create Table x (smarts Varchar Unique);
> 
> I have input data that has many duplicates and until now
> I was creating a temp table allowing duplicates, and then
> Insert into x (smarts) select distinct smarts from tmpx;
> This is now failing when I have hundreds of millions of
> rows, running out of temp space (while trying to sort to
> accomplish the distinct?)

Another option would be to do a select with a GROUP BY instead of DISTINCT.
That would allow for a hash aggregate plan (don't forget to analyse the
temp table first) which would use less memory.