Обсуждение: Upsert Functionality using CTEs
I am using a query like this to try and normalize a table. WITH nd as (select * from sales order by id limit 100), people_update as (update people p set first_name = nd.first_name from nd where p.email = nd.email returning nd.id), insert into people (first_name, email, created_at, updated_at) select first_name, email , now(), now() from nd left join people_update using(id) where people_update.id is null), This works pretty good except for when the top 100 records have duplicated email address (two sales for the same email address). I am wondering what the best strategy is for dealing with this scenario. Doing the records one at a time would work but obviously it would be much slower. There are no other columns I can rely on to make the record more unique either.
On Sun, Feb 10, 2013 at 5:23 PM, Tim Uckun <timuckun@gmail.com> wrote: > This works pretty good except for when the top 100 records have > duplicated email address (two sales for the same email address). How is it assumed to work when the migrating email already exists in people? > > I am wondering what the best strategy is for dealing with this > scenario. Doing the records one at a time would work but obviously it > would be much slower. There are no other columns I can rely on to > make the record more unique either. > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Sergey Konoplev Database and Software Architect http://www.linkedin.com/in/grayhemp Phones: USA +1 415 867 9984 Russia, Moscow +7 901 903 0499 Russia, Krasnodar +7 988 888 1979 Skype: gray-hemp Jabber: gray.ru@gmail.com
On Feb 11, 2013, at 2:23, Tim Uckun <timuckun@gmail.com> wrote: > This works pretty good except for when the top 100 records have > duplicated email address (two sales for the same email address). >=20 > I am wondering what the best strategy is for dealing with this > scenario. Doing the records one at a time would work but obviously it > would be much slower. There are no other columns I can rely on to > make the record more unique either. The best strategy is fixing your data-model so that you have a unique = key. As you found out already, e-mail addresses aren't very suitable as = unique keys for people. For this particular case I'd suggest adding a = surrogate key. Alternatively, you might try using (first_name, email) as your key. = You'll probably still get some duplicates, but they should be less and = perhaps few enough for your case. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest.
> The best strategy is fixing your data-model so that you have a unique key. As you found out already, e-mail addresses aren'tvery suitable as unique keys for people. For this particular case I'd suggest adding a surrogate key. > > Alternatively, you might try using (first_name, email) as your key. You'll probably still get some duplicates, but theyshould be less and perhaps few enough for your case. > Yea I'll have to see if I can do a better match.
On 11 February 2013 07:47, Alban Hertroys <haramrae@gmail.com> wrote: > On Feb 11, 2013, at 2:23, Tim Uckun <timuckun@gmail.com> wrote: > > > This works pretty good except for when the top 100 records have > > duplicated email address (two sales for the same email address). > > > > I am wondering what the best strategy is for dealing with this > > scenario. Doing the records one at a time would work but obviously it > > would be much slower. There are no other columns I can rely on to > > make the record more unique either. > > The best strategy is fixing your data-model so that you have a unique key. > As you found out already, e-mail addresses aren't very suitable as unique > keys for people. For this particular case I'd suggest adding a surrogate > key. > > Alternatively, you might try using (first_name, email) as your key. You'll > probably still get some duplicates, but they should be less and perhaps few > enough for your case. > I read it that he has multiple sales from the same person? In which case pretending that the two sales were from different people isn't the correct result at all. I may be missing the point of the query, but wasn't it to add an entry for each email address and (if it already exists in people) to update the firstname field to match the firstname in the order? Given that the results will be returned in random order I would just change the "nd" query to SELECT DISTINCT ON (email) , but like I said I may be missing the point. Geoff
> > > I read it that he has multiple sales from the same person? In which case > pretending that the two sales were from different people isn't the correct > result at all. Actually it turns out that both cases exist. > > I may be missing the point of the query, but wasn't it to add an entry for > each email address and (if it already exists in people) to update the > firstname field to match the firstname in the order? Yes. Basically the sales table is denormalized and I need to create or update normalized tables from it. As you have pointed out the difficult scenario is when the same person has multiple sales. > Given that the results will be returned in random order I would just change > the "nd" query to SELECT DISTINCT ON (email) , but like I said I may be > missing the point. I think in this case I am probably going to have go through the records one at a time. I run into the same issues with other columns such as products.