Re: identifying duplicates in table with redundancies
От | Tarlika Elisabeth Schmitz |
---|---|
Тема | Re: identifying duplicates in table with redundancies |
Дата | |
Msg-id | 20100929124107.245b1314@dick.coachhouse обсуждение исходный текст |
Ответ на | Re: identifying duplicates in table with redundancies (Andreas Schmitz <mailinglist@longimanus.net>) |
Список | pgsql-sql |
On Wed, 29 Sep 2010 10:40:03 +0200 Andreas Schmitz <mailinglist@longimanus.net> wrote: > On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote: >> On Tue, 28 Sep 2010 11:34:31 +0100 >> "Oliveiros d'Azevedo Cristina"<oliveiros.cristina@marktest.pt> >> wrote: >> >>> ----- Original Message ----- >>> From: "Tarlika Elisabeth Schmitz"<postgresql@numerixtechnology.de> >>> To:<pgsql-sql@postgresql.org> >>> Sent: Monday, September 27, 2010 5:54 PM >>> Subject: Re: [SQL] identifying duplicates in table with redundancies >>> >>> >>>> On Fri, 24 Sep 2010 18:12:18 +0100 >>>> Oliver d'Azevedo Christina<oliveiros.cristina@gmail.com> wrote: >>>> >>>>>>> SELECT DISTINCT trainer_id,trainer_name >>>>>>> FROM ( >>>>>>> SELECT trainer_name -- The field you want to test for >>>>>>> duplicates FROM ( >>>>>>> SELECT DISTINCT "trainer_id","trainer_name" >>>>>>> FROM student >>>>>>> ) x >>>>>>> GROUP BY "trainer_name" -- the field you want to test for >>>>>>> duplicates >>>>>>> HAVING (COUNT(*)> 1) >>>>>>> ) z >>>>>>> NATURAL JOIN student y >>>> >>>> >>>> What indices would you recommend for this operation? >>> But, on this query in particular I would recomend an indice on >>> trainer_name,[...] >> Strangely, these indices did not do anything. >> Without, the query took about 8500ms. Same with index. >> >> The table has 250000 records. 11000 have trainer_name = null. Only >> 13000 unique trainer_names. > >I guess explain analyze shows up a seq scan. try avoiding to use >distinct. use group by instead. > >regards, Andreas Hallo Andreas, I reduced the problem to the innermost query: 1) SELECT DISTINCT trainer_id, trainer_name FROM student This results in a sequential table scan. Execution time 7500ms. 2) I created an INDEX ON (trainer_id, trainer_name). Then I had an index scan instead, which still cost 7000ms. 3) Next, I changed from DISTINCT to GROUP BY: SELECT trainer_id, trainer_name FROM student GROUP BY trainer_id, trainer_name This resulted in an index scan @ 6750ms 4) I filtered out NULL trainer_ids WHERE trainer_id IS NOT NULL Amazingly, this resulted in a sequential table scan, which only took 1300ms!! Please, explain (pun not intended)! How can this be. Only 11000/250000 rows have a null trainer_id. Thanks for the GROUP BY tip! -- Best Regards, Tarlika Elisabeth Schmitz
В списке pgsql-sql по дате отправления: