Обсуждение: How to "unique-ify" HUGE table?
Hi everyone!
I have a very large 2-column table (about 500M records) from which I want to remove duplicate records.
I have tried many approaches, but they all take forever.
The table's definition consists of two short TEXT columns. It is a temporary table generated from a query:
CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ;
Initially I tried
CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ;
but after waiting for nearly an hour I aborted the query, and repeated it after getting rid of the DISTINCT clause.
Everything takes forever with this monster! It's uncanny. Even printing it out to a file takes forever, let alone creating an index for it.
Any words of wisdom on how to speed this up would be appreciated.
TIA!
Kynn
On Tue, Dec 23, 2008 at 10:25 AM, Kynn Jones <kynnjo@gmail.com> wrote: > Hi everyone! > I have a very large 2-column table (about 500M records) from which I want to > remove duplicate records. > I have tried many approaches, but they all take forever. > The table's definition consists of two short TEXT columns. It is a > temporary table generated from a query: > > CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ; > Initially I tried > CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ; > but after waiting for nearly an hour I aborted the query, and repeated it > after getting rid of the DISTINCT clause. > Everything takes forever with this monster! It's uncanny. Even printing it > out to a file takes forever, let alone creating an index for it. > Any words of wisdom on how to speed this up would be appreciated. Did you try cranking up work_mem to something that's a large percentage (25 to 50%) of total memory?
On Tue, 23 Dec 2008 12:25:48 -0500 "Kynn Jones" <kynnjo@gmail.com> wrote: > Hi everyone! > I have a very large 2-column table (about 500M records) from which I want to > remove duplicate records. > > I have tried many approaches, but they all take forever. > > The table's definition consists of two short TEXT columns. It is a > temporary table generated from a query: > > CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ; > > Initially I tried > > CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ; > > but after waiting for nearly an hour I aborted the query, and repeated it Do you have an index on x and y? Also, does this work better? CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... GROUP BY x, y; What does ANALYZE EXPLAIN have to say? -- D'Arcy J.M. Cain <darcy@druid.net> | Democracy is three wolves http://www.druid.net/darcy/ | and a sheep voting on +1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
You don't say what PG version you are on, but just for kicks you may try using GROUP BY instead of DISTINCT. Yes, the two should perform the same, but with 8.1 (or maybe 8.0) I had seen situations where GROUP BY was faster (admittedly this happened with more complex queries). So, try this: CREATE TEMP TABLE huge_table AS SELECT x, y FROM foo GROUP BY 1, 2; Note that you may be tempted to add an index on foo(x,y), but I don't think that helps (or at least I have not been able to hit the index in similar situations). > -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance- > owner@postgresql.org] On Behalf Of Kynn Jones > Sent: Tuesday, December 23, 2008 9:26 AM > To: pgsql-performance@postgresql.org > Subject: [PERFORM] How to "unique-ify" HUGE table? > > Hi everyone! > > I have a very large 2-column table (about 500M records) from which I > want to remove duplicate records. > > I have tried many approaches, but they all take forever. > > The table's definition consists of two short TEXT columns. It is a > temporary table generated from a query: > > > CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ; > > Initially I tried > > CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ; > > but after waiting for nearly an hour I aborted the query, and repeated > it after getting rid of the DISTINCT clause. > > Everything takes forever with this monster! It's uncanny. Even > printing it out to a file takes forever, let alone creating an index > for it. > > Any words of wisdom on how to speed this up would be appreciated. > > TIA! > > Kynn > >
On Tue, Dec 23, 2008 at 11:14 AM, George Pavlov <gpavlov@mynewplace.com> wrote: > You don't say what PG version you are on, but just for kicks you may try > using GROUP BY instead of DISTINCT. Yes, the two should perform the > same, but with 8.1 (or maybe 8.0) I had seen situations where GROUP BY > was faster (admittedly this happened with more complex queries). So, try > this: Even in 8.3 it looks like group by is faster. Tested it on a decent sized table and group by used a hash agg and ran in ~600 ms, while distinct used a sort and ran in 1300 ms. That was on 500k rows. On a much larger table, one with about 10M rows, a similar statement runs in 1500 ms with group by and in 2390 ms when run with distinct.
Scott Marlowe wrote: > On Tue, Dec 23, 2008 at 11:14 AM, George Pavlov <gpavlov@mynewplace.com> wrote: >> You don't say what PG version you are on, but just for kicks you may try >> using GROUP BY instead of DISTINCT. Yes, the two should perform the >> same, but with 8.1 (or maybe 8.0) I had seen situations where GROUP BY >> was faster (admittedly this happened with more complex queries). So, try >> this: > > Even in 8.3 it looks like group by is faster. Tested it on a decent > sized table and group by used a hash agg and ran in ~600 ms, while > distinct used a sort and ran in 1300 ms. That was on 500k rows. On a > much larger table, one with about 10M rows, a similar statement runs > in 1500 ms with group by and in 2390 ms when run with distinct. Not surprising - this is a known limitation in all released versions of postgresql (GROUP BY can use hashing and sorting - DISTINCT only sorting). 8.4 is going to improve that though. Stefan
Thank you all for the very helpful advice. Upping work_mem made it possible for me to generate the table within this century without bringing the server to a near standstill. I have not yet experimented with GROUP BY, but I'll do this next.
Cheers,
Kynn