Обсуждение: Upsert Functionality using CTEs

Поиск
Список
Период
Сортировка

Upsert Functionality using CTEs

От
Tim Uckun
Дата:
I am using a query like this to try and normalize a table.

WITH nd as (select * from sales order by id limit 100),

people_update as (update people p set first_name = nd.first_name from
nd where p.email = nd.email returning nd.id),

insert into  people (first_name, email, created_at, updated_at)
                      select first_name, email , now(), now()
                      from nd
                      left join people_update using(id) where
people_update.id is null),

This works pretty good except for when the top 100 records have
duplicated email address (two sales for the same email address).

I am wondering what the best strategy is for dealing with this
scenario.  Doing the records one at a time would work but obviously it
would be much slower.  There are no other columns I can rely on to
make the record more unique either.

Re: Upsert Functionality using CTEs

От
Sergey Konoplev
Дата:
On Sun, Feb 10, 2013 at 5:23 PM, Tim Uckun <timuckun@gmail.com> wrote:
> This works pretty good except for when the top 100 records have
> duplicated email address (two sales for the same email address).

How is it assumed to work when the migrating email already exists in people?

>
> I am wondering what the best strategy is for dealing with this
> scenario.  Doing the records one at a time would work but obviously it
> would be much slower.  There are no other columns I can rely on to
> make the record more unique either.
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general



--
Sergey Konoplev
Database and Software Architect
http://www.linkedin.com/in/grayhemp

Phones:
USA +1 415 867 9984
Russia, Moscow +7 901 903 0499
Russia, Krasnodar +7 988 888 1979

Skype: gray-hemp
Jabber: gray.ru@gmail.com

Re: Upsert Functionality using CTEs

От
Alban Hertroys
Дата:
On Feb 11, 2013, at 2:23, Tim Uckun <timuckun@gmail.com> wrote:

> This works pretty good except for when the top 100 records have
> duplicated email address (two sales for the same email address).
>=20
> I am wondering what the best strategy is for dealing with this
> scenario.  Doing the records one at a time would work but obviously it
> would be much slower.  There are no other columns I can rely on to
> make the record more unique either.

The best strategy is fixing your data-model so that you have a unique =
key. As you found out already, e-mail addresses aren't very suitable as =
unique keys for people. For this particular case I'd suggest adding a =
surrogate key.

Alternatively, you might try using (first_name, email) as your key. =
You'll probably still get some duplicates, but they should be less and =
perhaps few enough for your case.


Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

Re: Upsert Functionality using CTEs

От
Tim Uckun
Дата:
> The best strategy is fixing your data-model so that you have a unique key. As you found out already, e-mail addresses
aren'tvery suitable as unique keys for people. For this particular case I'd suggest adding a surrogate key. 
>
> Alternatively, you might try using (first_name, email) as your key. You'll probably still get some duplicates, but
theyshould be less and perhaps few enough for your case. 
>

Yea I'll have to see if I can do a better match.

Re: Upsert Functionality using CTEs

От
Geoff Winkless
Дата:
On 11 February 2013 07:47, Alban Hertroys <haramrae@gmail.com> wrote:

> On Feb 11, 2013, at 2:23, Tim Uckun <timuckun@gmail.com> wrote:
>
> > This works pretty good except for when the top 100 records have
> > duplicated email address (two sales for the same email address).
> >
> > I am wondering what the best strategy is for dealing with this
> > scenario.  Doing the records one at a time would work but obviously it
> > would be much slower.  There are no other columns I can rely on to
> > make the record more unique either.
>
> The best strategy is fixing your data-model so that you have a unique key.
> As you found out already, e-mail addresses aren't very suitable as unique
> keys for people. For this particular case I'd suggest adding a surrogate
> key.
>
> Alternatively, you might try using (first_name, email) as your key. You'll
> probably still get some duplicates, but they should be less and perhaps few
> enough for your case.
>

I read it that he has multiple sales from the same person? In which case
pretending that the two sales were from different people isn't the correct
result at all.

I may be missing the point of the query, but wasn't it to add an entry for
each email address and (if it already exists in people) to update the
firstname field to match the firstname in the order?

Given that the results will be returned in random order I would just change
the "nd" query to SELECT DISTINCT ON (email) , but like I said I may be
missing the point.

Geoff

Re: Upsert Functionality using CTEs

От
Tim Uckun
Дата:
>
>
> I read it that he has multiple sales from the same person? In which case
> pretending that the two sales were from different people isn't the correct
> result at all.

Actually it turns out that both cases exist.

>
> I may be missing the point of the query, but wasn't it to add an entry for
> each email address and (if it already exists in people) to update the
> firstname field to match the firstname in the order?

Yes. Basically the sales table is denormalized and I need to create or
update normalized tables from it. As you have pointed out the
difficult scenario is when the same person has multiple sales.


> Given that the results will be returned in random order I would just change
> the "nd" query to SELECT DISTINCT ON (email) , but like I said I may be
> missing the point.

I think in this case I am probably going to have go through the
records one at a time.  I run into the same issues with other columns
such as products.