Slow Count-Distinct Query

Поиск

Список

Период

Сортировка

От	Christopher Jackson
Тема	Slow Count-Distinct Query
Дата	30 марта 2014 г. 22:45:56
Msg-id	CAN81C19-crDaZF3QysefWqL8qNSPZnJu2HY-ys4C-Szpd7LBvw@mail.gmail.com обсуждение исходный текст
Ответы	Re: Slow Count-Distinct Query Re: Slow Count-Distinct Query Re: Slow Count-Distinct Query
Список	pgsql-performance

Дерево обсуждения

Hi all,

tl;dr - How can I speed up my count-distinct query?

I apologize in advance if this question has been asked already. I'm finding the mailing list hard to navigate. I'm trying to speed up a query that will find a count of distinct emails with in a table using Postgres 9.3.3. The name of the table is participants. Our domain is set up such that duplicate emails are allowed so long as a particular corresponding column value is unique.

TABLE participants

id serial NOT NULL (primary key)

email character varying(255)

(other columns omitted)

I have the following index defined:

index_participants_on_email ON participants USING btree (email COLLATE pg_catalog."default");

The query I'm trying to run is select count(distinct email) from participants. I've also tried the group by equivalent. On a table size of 2 million rows, the query takes about 1 minute to return. This is way too long. After running analyze, I see that the index is being ignored and a full table scan is performed.

So, I tried running the following after dropping the index:

create index email_idx on participants(email) where email=email;

set enable_bitmapscan = false;

set seq_page_cost = 0.1;

set random_page_cost = 0.2;

create index email_idx_2 on participants(email);
cluster participants using email_idx_2;

With these settings in place, if I run select count(distinct email) from participants I get

"Aggregate (cost=29586.20..29586.21 rows=1 width=18) (actual time=54243.643..54243.644 rows=1 loops=1)"

" -> Seq Scan on participants (cost=0.00..24586.18 rows=2000008 width=18) (actual time=0.030..550.296 rows=2000008 loops=1)"

"Total runtime: 54243.669 ms"

When I run the following, I get MUCH better results

select count(1) from (select email from participants where email=email group by email) x;

"Aggregate (cost=1856.36..1856.37 rows=1 width=0) (actual time=1393.573..1393.573 rows=1 loops=1)"

" Output: count(1)"

" -> Group (cost=0.43..1731.36 rows=10000 width=18) (actual time=0.052..1205.977 rows=2000008 loops=1)"

" Output: participants.email"

" -> Index Only Scan using email_idx on public.participants (cost=0.43..1706.36 rows=10000 width=18) (actual time=0.050..625.248 rows=2000008 loops=1)"

" Output: participants.email"

" Heap Fetches: 2000008"

"Total runtime: 1393.599 ms"

This query has a weird where clause (email=email) because I'm trying to force the analyzer's hand to use the index.

I'm concerned about setting the enable_bitmapscan and seq_page_cost values because I'm not yet sure what the consequences are. Can anyone enlighten me on the recommended way to speed up this query?

Thanks

В списке pgsql-performance по дате отправления:

Предыдущее

От: Brett Wooldridge
Дата: 30 марта 2014 г., 06:41:49
Сообщение: Re: Connection pooling - Number of connections

Следующее

От: Niels Kristian Schjødt
Дата: 31 марта 2014 г., 13:26:01
Сообщение: Sudden crazy high CPU usage

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Slow Count-Distinct Query

Предыдущее

Следующее