Re: De-duplicating rows

Поиск
Список
Период
Сортировка
От Scott Marlowe
Тема Re: De-duplicating rows
Дата
Msg-id dcc563d10907170832r1f41fdfcl2377f72bcdcb3dd8@mail.gmail.com
обсуждение исходный текст
Ответ на De-duplicating rows  (Christophe <xof@thebuild.com>)
Список pgsql-sql
On Thu, Jul 16, 2009 at 9:07 PM, Christophe<xof@thebuild.com> wrote:
> The Subject: is somewhat imprecise, but here's what I'm trying to do.  For
> some reason, my brain is locking up over it.
>
> I'm moving a 7.2 (yes) database to 8.4.  In the table in question, the
> structure is along the lines of:
>
>        serial_number   SERIAL, PRIMARY KEY
>        email           TEXT
>        create_date     TIMESTAMP
>        attr1           type
>        attr2           type
>        attr3           type
>        ...
>
> (The point of the "attr" fields is that there are many more columns for each
> row.)
>
> The new structure removes the "serial_number" field, and uses "email" as the
> primary key, but is otherwise unchanged:
>
>        email           TEXT, PRIMARY KEY
>        create_date     TIMESTAMP
>        attr1           type
>        attr2           type
>        attr3           type
>        ...
>
> Now, since this database has been production since 7.2 days, cruft has crept
> in: in particular, there are duplicate email addresses, some with mismatched
> attributes.  The policy decision by the client is that the correct row is
> the one with the earliest timestamp.  (The timestamps are widely
> distributed; it's not the case that there is a single timestamp above which
> all the duplicates live.)  Thus, ideally, I want to select exactly one row
> per "email", picking the row with the earliest timestamp in the case that
> there is more than one row with that email.
>
> Any suggestions on how to write such a SELECT?  Of course, I could do this
> with an application against the db, but a single SELECT would be great if
> possible.

OK, assuming we can keep the serial number during the conversion, we
could use something like this:

select distinct a.serial_number from table a join table b on
(a.email=b.email and a.serial_number>b.serial_number)

Now assuming that the serial numbers and the timestamps are in order
together, that'll give us all the serial numbers for all the matching
email addresses EXCEPT the first one.  If the serial numbers are not
in order with the timestamps, then create a sequence, and update them
in order, then the query will work.  Once you've confirmed by hand
that the first hundred or so serial_numbers you're getting back ARE in
fact all n+1 for the same email address, use the select in a subselect
to delete:

delete from table x where serial_number in (select distinct....)


В списке pgsql-sql по дате отправления:

Предыдущее
От: Frank Bax
Дата:
Сообщение: Re: De-duplicating rows
Следующее
От: "Gianvito Pio"
Дата:
Сообщение: Timestamp with timezone with Default value