Обсуждение: Slow Count-Distinct Query

Поиск
Список
Период
Сортировка

Slow Count-Distinct Query

От
Christopher Jackson
Дата:
  Hi all,

  tl;dr - How can I speed up my count-distinct query?  

  I apologize in advance if this question has been asked already.  I'm finding the mailing list hard to navigate.  I'm trying to speed up a query that will find a count of distinct emails with in a table using Postgres 9.3.3.  The name of the table is participants.  Our domain is set up such that duplicate emails are allowed so long as a particular corresponding column value is unique.

TABLE participants

  id serial NOT NULL (primary key)

  email character varying(255)

  (other columns omitted)

 

I have the following index defined:

index_participants_on_email ON participants USING btree (email COLLATE pg_catalog."default");

The query I'm trying to run is select count(distinct email) from participants.  I've also tried the group by equivalent.  On a table size of 2 million rows, the query takes about 1 minute to return.  This is way too long.  After running analyze, I see that the index is being ignored and a full table scan is performed.

So, I tried running the following after dropping the index:

create index email_idx on participants(email) where email=email;
set enable_bitmapscan = false;
set seq_page_cost = 0.1;
set random_page_cost = 0.2;
create index email_idx_2 on participants(email);
cluster participants using email_idx_2;

With these settings in place, if I run select count(distinct email) from participants I get

"Aggregate  (cost=29586.20..29586.21 rows=1 width=18) (actual time=54243.643..54243.644 rows=1 loops=1)"
"  ->  Seq Scan on participants  (cost=0.00..24586.18 rows=2000008 width=18) (actual time=0.030..550.296 rows=2000008 loops=1)"
"Total runtime: 54243.669 ms"

When I run the following, I get MUCH better results
select count(1) from (select email from participants where email=email group by email) x;

"Aggregate  (cost=1856.36..1856.37 rows=1 width=0) (actual time=1393.573..1393.573 rows=1 loops=1)"
"  Output: count(1)"
"  ->  Group  (cost=0.43..1731.36 rows=10000 width=18) (actual time=0.052..1205.977 rows=2000008 loops=1)"
"        Output: participants.email"
"        ->  Index Only Scan using email_idx on public.participants  (cost=0.43..1706.36 rows=10000 width=18) (actual time=0.050..625.248 rows=2000008 loops=1)"
"              Output: participants.email"
"              Heap Fetches: 2000008"
"Total runtime: 1393.599 ms"

This query has a weird where clause (email=email) because I'm trying to force the analyzer's hand to use the index.

I'm concerned about setting the enable_bitmapscan and seq_page_cost values because I'm not yet sure what the consequences are.  Can anyone enlighten me on the recommended way to speed up this query?

 Thanks



Re: Slow Count-Distinct Query

От
Shaun Thomas
Дата:
>  tl;dr - How can I speed up my count-distinct query?  

You can't.

Doing a count(distinct x) is much different than a count(1), which can simply scan available indexes. To build a
distinct,it has to construct an in-memory hash of every valid email, and count the distinct values therein. This will
prettymuch never be fast, especially with 2M rows involved. 

I could be wrong about this, and the back-end folks might have a different answer, but I wouldn't hold my breath.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 400 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Slow Count-Distinct Query

От
Tom Lane
Дата:
Christopher Jackson <crjackso@gmail.com> writes:
>   tl;dr - How can I speed up my count-distinct query?

EXPLAIN doesn't provide a lot of visibility into what the Aggregate plan
node is doing, but in this case what it's doing is an internal sort/uniq
operation to implement the DISTINCT.  You didn't say what value of
work_mem you're using, but it'd need to be probably 50-100MB to prevent
that sort from spilling to disk (and therefore being slow).

Note that the indexscan is actually *slower* than the seqscan so far as
the table access is concerned; if the table were big enough to not fit
in RAM, this would get very much worse.  So I'm not impressed with trying
to force the optimizer's hand as you've done here --- it might be a nice
plan now, but it's brittle.  See if a bigger work_mem improves matters
enough with the regular plan.

> *I'm concerned about setting the enable_bitmapscan and seq_page_cost values
> because I'm not yet sure what the consequences are.  Can anyone enlighten
> me on the recommended way to speed up this query?*

Turning off enable_bitmapscan globally would be a seriously bad idea.
Changing the cost settings to these values globally might be all right;
it would amount to optimizing for all-in-memory cases, which might or
might not be a good idea for your situation.  For that matter, greatly
increasing work_mem globally is usually not thought to be smart either;
remember that it's a per-sort-operation setting so you may need to
provision a considerable multiple of the setting as physical RAM,
depending on how many queries you expect to run concurrently.  So all in
all you might be well advised to just set special values for this one
query, whichever solution approach you use.

I doubt you need the "where email=email" hack, in any case.  That isn't
forcing the optimizer's decision in any meaningful fashion.

            regards, tom lane


Re: Slow Count-Distinct Query

От
Christopher Jackson
Дата:

   Tom and Shawn,

   Thanks for the feedback.  This has been helpful.  It's worth noting that I was spiking this out on my local box using default memory utilization settings.  I'll revisit this once we get our production box set up.  It's good to know what the best practices are around the enable_bitmapscan and seq_page_cost settings are.  Also, it's good to know that my hack wasn't actually yielding anything.  I'll check back in once our production environment is up and running.  For what it's worth, we're using Heroku and we're thinking of going with the Standard Tengu tier as a start.  This will give us 1.7GB of RAM, so hopefully bumping up the work_mem setting shouldn't be a problem.  Does that make sense?

  Thanks for the help,
     Chris


On Mon, Mar 31, 2014 at 9:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Christopher Jackson <crjackso@gmail.com> writes:
>   tl;dr - How can I speed up my count-distinct query?

EXPLAIN doesn't provide a lot of visibility into what the Aggregate plan
node is doing, but in this case what it's doing is an internal sort/uniq
operation to implement the DISTINCT.  You didn't say what value of
work_mem you're using, but it'd need to be probably 50-100MB to prevent
that sort from spilling to disk (and therefore being slow).

Note that the indexscan is actually *slower* than the seqscan so far as
the table access is concerned; if the table were big enough to not fit
in RAM, this would get very much worse.  So I'm not impressed with trying
to force the optimizer's hand as you've done here --- it might be a nice
plan now, but it's brittle.  See if a bigger work_mem improves matters
enough with the regular plan.

> *I'm concerned about setting the enable_bitmapscan and seq_page_cost values
> because I'm not yet sure what the consequences are.  Can anyone enlighten
> me on the recommended way to speed up this query?*

Turning off enable_bitmapscan globally would be a seriously bad idea.
Changing the cost settings to these values globally might be all right;
it would amount to optimizing for all-in-memory cases, which might or
might not be a good idea for your situation.  For that matter, greatly
increasing work_mem globally is usually not thought to be smart either;
remember that it's a per-sort-operation setting so you may need to
provision a considerable multiple of the setting as physical RAM,
depending on how many queries you expect to run concurrently.  So all in
all you might be well advised to just set special values for this one
query, whichever solution approach you use.

I doubt you need the "where email=email" hack, in any case.  That isn't
forcing the optimizer's decision in any meaningful fashion.

                        regards, tom lane

Re: Slow Count-Distinct Query

От
bricklen
Дата:

On Sun, Mar 30, 2014 at 12:45 PM, Christopher Jackson <crjackso@gmail.com> wrote:
  Hi all,

  tl;dr - How can I speed up my count-distinct query?  

Depending on how often you need to run that query and how important it is to you, if you are willing to accept a performance hit on INSERT/UPDATE/DELETE of the "participants" table, you could create a summary table containing just the count of unique email addresses or the list of unique email addresses populated via trigger on INSERT/UPDATE/DELETE of the  participants table. Another option is try out the new Materialized views (http://www.postgresql.org/docs/current/static/sql-creatematerializedview.html) available in 9.3.

Re: Slow Count-Distinct Query

От
Christopher Jackson
Дата:

    Hi Bricklen,

    Thanks for the feedback.  I'll play around with materialized views.  My understanding is they have to be manually triggered for refresh and there's an exclusive lock on the view while the refresh is taking place.  Is this your understanding as well?  I'm using PG 9.3.3.  If this is true, I'm curious what clever ways people have come up with to mitigate any issues with the lock.

   Thanks again,
      Chris


On Tue, Apr 1, 2014 at 7:34 PM, bricklen <bricklen@gmail.com> wrote:

On Sun, Mar 30, 2014 at 12:45 PM, Christopher Jackson <crjackso@gmail.com> wrote:
  Hi all,

  tl;dr - How can I speed up my count-distinct query?  

Depending on how often you need to run that query and how important it is to you, if you are willing to accept a performance hit on INSERT/UPDATE/DELETE of the "participants" table, you could create a summary table containing just the count of unique email addresses or the list of unique email addresses populated via trigger on INSERT/UPDATE/DELETE of the  participants table. Another option is try out the new Materialized views (http://www.postgresql.org/docs/current/static/sql-creatematerializedview.html) available in 9.3.


Re: Slow Count-Distinct Query

От
Michael Paquier
Дата:
On Wed, Apr 2, 2014 at 1:22 PM, Christopher Jackson <crjackso@gmail.com> wrote:
>
>     Hi Bricklen,
>
>     Thanks for the feedback.  I'll play around with materialized views.  My
> understanding is they have to be manually triggered for refresh
Yep.

> and there's an exclusive lock on the view while the refresh is taking place.  Is this
> your understanding as well?
Re-yep.

> I'm using PG 9.3.3.  If this is true, I'm
> curious what clever ways people have come up with to mitigate any issues
> with the lock.
Kevin Grittner has implemented REFRESH MATERIALIZED VIEW CONCURRENTLY
in 9.4. A unique index is needed on the materialized view as well to
authorize this concurrent operation. It has the merit to allow SELECT
operations on the matview during the refresh.
--
Michael


Re: Slow Count-Distinct Query

От
Christopher Jackson
Дата:


    Hi Bricklen,

   Thanks again for the feedback.  The concurrent refresh sounds cool.  I just saw the 9.4 release is tentatively scheduled for later this year.  Do you know what people have been doing for view refreshes in the meantime?

   Thanks


On Tue, Apr 1, 2014 at 11:48 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, Apr 2, 2014 at 1:22 PM, Christopher Jackson <crjackso@gmail.com> wrote:
>
>     Hi Bricklen,
>
>     Thanks for the feedback.  I'll play around with materialized views.  My
> understanding is they have to be manually triggered for refresh
Yep.

> and there's an exclusive lock on the view while the refresh is taking place.  Is this
> your understanding as well?
Re-yep.

> I'm using PG 9.3.3.  If this is true, I'm
> curious what clever ways people have come up with to mitigate any issues
> with the lock.
Kevin Grittner has implemented REFRESH MATERIALIZED VIEW CONCURRENTLY
in 9.4. A unique index is needed on the materialized view as well to
authorize this concurrent operation. It has the merit to allow SELECT
operations on the matview during the refresh.
--
Michael