Re: Finding uniques across a big join

Поиск
Список
Период
Сортировка
От Scott Marlowe
Тема Re: Finding uniques across a big join
Дата
Msg-id 1133390546.16010.30.camel@state.g2switchworks.com
обсуждение исходный текст
Ответ на Re: Finding uniques across a big join  ("John D. Burger" <john@mitre.org>)
Ответы Re: Finding uniques across a big join  ("John D. Burger" <john@mitre.org>)
Список pgsql-general
On Wed, 2005-11-30 at 16:27, John D. Burger wrote:
> Scott Marlowe wrote:
>
> > select
> >     v1.pkey1,
> >     v1.field2,
> >     v1.field3,
> >     v1.field4,
> >     v2.pkey1,
> >     v2.field2,
> >     v2.field3,
> >     v2.field4,
> > from
> >     view v1
> > join
> >     view v2
> > on (
> >     v1.field2=v2.field2 and
> >     v1.field3=v2.field3 and
> >     v1.field3=v2.field3 and
> >     v1.pkey1<>v2.pkey
> > )
> >
> > How does that work?
>
> Won't this be a massive cross product of all pkey pairs that have the
> same field values?
>

Yes, assuming there are a lot of them.  OTOH, if there are only a few
duplicates you're looking for...

How many are you expecting, percentage wise, to get back?

> Here's what I'm currently using, in terms of your very helpful view:
>
> select v1.pkey1, v1.field2, v1.field3, v1.field4
>    from view as v1
>    join
>    (select v2.field1, v2.field2, v2.field3
>     from view as v2
>     group by v2.field2,  v2.field3, v2.field4
>     having count(*) = 1)
>    using (field2, field3, field4);
>
> This is the one that takes eight hours. :(  Another way to express what
> I want is this:
>
> select v1.pkey1, v1.field2, v1.field3, v1.field4
>    from view as v1
>    where not exists
>      (select true from view as v2
>       where v1.field2 = v2.field2
>         and v1.field3 = v2.field3
>         and v1.field4 = v2.field4
>         and v1.pkey1 <> v2.pkey1);
>
> That looks like a horrible nested loop, but I suppose I should try it
> to make sure it is indeed slower then the previous query.

If you can allocated enough shared memory for the set to fit in memory,
you might be able to get a hash agg method, which is much faster than
most other methods for this kind of thing, since it requires no sort.

In a side point, I'm currently mushing 888,000,000 6 character codes up
against each other to check for duplicates.  I have 6 machines doing
this, at 1 million codes compared to 1 million codes every 0.5 seconds
aggregate.  That gets me down to about 1 week.  So, 8 hours is seeming
quite fast.  :)

В списке pgsql-general по дате отправления:

Предыдущее
От: Tyler MacDonald
Дата:
Сообщение: Re: undefined behaviour for sub-transactions?
Следующее
От: "Jim C. Nasby"
Дата:
Сообщение: Re: Finding uniques across a big join