Обсуждение: How to "unique-ify" HUGE table?

Поиск
Список
Период
Сортировка

How to "unique-ify" HUGE table?

От
"Kynn Jones"
Дата:
Hi everyone!

I have a very large 2-column table (about 500M records) from which I want to remove duplicate records.

I have tried many approaches, but they all take forever.

The table's definition consists of two short TEXT columns.  It is a temporary table generated from a query:

CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ;

Initially I tried

CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ;

but after waiting for nearly an hour I aborted the query, and repeated it after getting rid of the DISTINCT clause.

Everything takes forever with this monster!  It's uncanny.  Even printing it out to a file takes forever, let alone creating an index for it.

Any words of wisdom on how to speed this up would be appreciated.

TIA!

Kynn



Re: How to "unique-ify" HUGE table?

От
"Scott Marlowe"
Дата:
On Tue, Dec 23, 2008 at 10:25 AM, Kynn Jones <kynnjo@gmail.com> wrote:
> Hi everyone!
> I have a very large 2-column table (about 500M records) from which I want to
> remove duplicate records.
> I have tried many approaches, but they all take forever.
> The table's definition consists of two short TEXT columns.  It is a
> temporary table generated from a query:
>
> CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ;
> Initially I tried
> CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ;
> but after waiting for nearly an hour I aborted the query, and repeated it
> after getting rid of the DISTINCT clause.
> Everything takes forever with this monster!  It's uncanny.  Even printing it
> out to a file takes forever, let alone creating an index for it.
> Any words of wisdom on how to speed this up would be appreciated.

Did you try cranking up work_mem to something that's a large
percentage (25 to 50%) of total memory?

Re: How to "unique-ify" HUGE table?

От
"D'Arcy J.M. Cain"
Дата:
On Tue, 23 Dec 2008 12:25:48 -0500
"Kynn Jones" <kynnjo@gmail.com> wrote:
> Hi everyone!
> I have a very large 2-column table (about 500M records) from which I want to
> remove duplicate records.
>
> I have tried many approaches, but they all take forever.
>
> The table's definition consists of two short TEXT columns.  It is a
> temporary table generated from a query:
>
> CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ;
>
> Initially I tried
>
>  CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ;
>
> but after waiting for nearly an hour I aborted the query, and repeated it

Do you have an index on x and y?  Also, does this work better?

CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... GROUP BY x, y;

What does ANALYZE EXPLAIN have to say?

--
D'Arcy J.M. Cain <darcy@druid.net>         |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.

Re: How to "unique-ify" HUGE table?

От
"George Pavlov"
Дата:
You don't say what PG version you are on, but just for kicks you may try
using GROUP BY instead of DISTINCT. Yes, the two should perform the
same, but with 8.1 (or maybe 8.0) I had seen situations where GROUP BY
was faster (admittedly this happened with more complex queries). So, try
this:

  CREATE TEMP TABLE huge_table AS SELECT x, y FROM foo GROUP BY 1, 2;

Note that you may be tempted to add an index on foo(x,y), but I don't
think that helps (or at least I have not been able to hit the index in
similar situations).


> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of Kynn Jones
> Sent: Tuesday, December 23, 2008 9:26 AM
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] How to "unique-ify" HUGE table?
>
> Hi everyone!
>
> I have a very large 2-column table (about 500M records) from which I
> want to remove duplicate records.
>
> I have tried many approaches, but they all take forever.
>
> The table's definition consists of two short TEXT columns.  It is a
> temporary table generated from a query:
>
>
> CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ;
>
> Initially I tried
>
> CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ;
>
> but after waiting for nearly an hour I aborted the query, and repeated
> it after getting rid of the DISTINCT clause.
>
> Everything takes forever with this monster!  It's uncanny.  Even
> printing it out to a file takes forever, let alone creating an index
> for it.
>
> Any words of wisdom on how to speed this up would be appreciated.
>
> TIA!
>
> Kynn
>
>


Re: How to "unique-ify" HUGE table?

От
"Scott Marlowe"
Дата:
On Tue, Dec 23, 2008 at 11:14 AM, George Pavlov <gpavlov@mynewplace.com> wrote:
> You don't say what PG version you are on, but just for kicks you may try
> using GROUP BY instead of DISTINCT. Yes, the two should perform the
> same, but with 8.1 (or maybe 8.0) I had seen situations where GROUP BY
> was faster (admittedly this happened with more complex queries). So, try
> this:

Even in 8.3 it looks like group by is faster.  Tested it on a decent
sized table and group by used a hash agg and ran in ~600 ms, while
distinct used a sort and ran in 1300 ms.  That was on 500k rows.  On a
much larger table, one with about 10M rows, a similar statement runs
in 1500 ms with group by and in 2390 ms when run with distinct.

Re: How to "unique-ify" HUGE table?

От
Stefan Kaltenbrunner
Дата:
Scott Marlowe wrote:
> On Tue, Dec 23, 2008 at 11:14 AM, George Pavlov <gpavlov@mynewplace.com> wrote:
>> You don't say what PG version you are on, but just for kicks you may try
>> using GROUP BY instead of DISTINCT. Yes, the two should perform the
>> same, but with 8.1 (or maybe 8.0) I had seen situations where GROUP BY
>> was faster (admittedly this happened with more complex queries). So, try
>> this:
>
> Even in 8.3 it looks like group by is faster.  Tested it on a decent
> sized table and group by used a hash agg and ran in ~600 ms, while
> distinct used a sort and ran in 1300 ms.  That was on 500k rows.  On a
> much larger table, one with about 10M rows, a similar statement runs
> in 1500 ms with group by and in 2390 ms when run with distinct.

Not surprising - this is a known limitation in all released versions of
postgresql (GROUP BY can use hashing and sorting - DISTINCT only
sorting). 8.4 is going to improve that though.


Stefan

Re: How to "unique-ify" HUGE table?

От
"Kynn Jones"
Дата:
Thank you all for the very helpful advice.  Upping work_mem made it possible for me to generate the table within this century without bringing the server to a near standstill.  I have not yet experimented with GROUP BY, but I'll do this next.

Cheers,

Kynn