Fwd: Slow Count-Distinct Query

Поиск
Список
Период
Сортировка
От Varadharajan Mukundan
Тема Fwd: Slow Count-Distinct Query
Дата
Msg-id CACKkDGE-6LJF-vJdmaPiGghLjwcQXDF1cAsnZtQhF+wA9KVtFQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: Fwd: Slow Count-Distinct Query  (Jeff Janes <jeff.janes@gmail.com>)
Список pgsql-performance
Sorry that i just joined the list and have to break the thread to reply to Tom Lane's response on this @ http://www.postgresql.org/message-id/13741.1396275339@sss.pgh.pa.us


Note that the indexscan is actually *slower* than the seqscan so far as
the table access is concerned; if the table were big enough to not fit
in RAM, this would get very much worse.  So I'm not impressed with trying
to force the optimizer's hand as you've done here --- it might be a nice
plan now, but it's brittle.  See if a bigger work_mem improves matters
enough with the regular plan.

I agree to the point that hand tuning optimiser is brittle and something that should not be done. But the reason to that was to force the index-only scan (Not the index scan). I feel Index-only scan would speed up given postgres is row oriented and we are running count-distinct on a column in a table with lot of columns (Say 6-7 in number). I think that is what have contributed to the gain in performance. 

I did a similar test with around 2 million tuples with work_mem = 250 MB and got the query to respond with 2x speed up. But the speed-up got with index-only scan was huge and response was in sub-seconds whereas with work_mem the response was couple of seconds.


I doubt you need the "where email=email" hack, in any case.  That isn't
forcing the optimizer's decision in any meaningful fashion.

This is the where clause which forced the index-only scan on partial index. AFAIK, whenever the the tuples estimated to be processed by query is going to be high, Postgres does a seq scan. This was happening even in the following query even though an index was present in the email column and index-only could have had much faster processing. I think the planner is taking a hit on estimating the cost of the query with index-only scan. So the little trick we had put in was to create a partial index on email field of the participants table and use the same condition of the partial index in the query to trick the optimiser to use index-only scans.

Query : select count(distinct email) from participants;

Please revert me back if i'm missing something.


--
Thanks,
M. Varadharajan

------------------------------------------------

"Experience is what you get when you didn't get what you wanted"
               -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- www.thinkasgeek.wordpress.com

В списке pgsql-performance по дате отправления:

Предыдущее
От: Ilya Kosmodemiansky
Дата:
Сообщение: Re: PGSQL, checkpoints, and file system syncs
Следующее
От: Nicolas Paris
Дата:
Сообщение: PGSQL 9.3 - Materialized View - multithreading