Обсуждение: Avoid sorting when doing an array_agg

Поиск
Список
Период
Сортировка

Avoid sorting when doing an array_agg

От
Alexis Woo
Дата:
I have a users table which contains ~70 million rows that looks like this:

   Column    |       Type        |
-------------+-------------------+
 id          | integer           |
 first_name  | character varying |
 last_name   | character varying |
 category_id | integer           |
Indexes:
    "users_id_idx" btree (id)
    "users_category_id_first_name_last_name_idx" btree (category_id, first_name, last_name)

I'm trying to retrieve the ids for each (first_name, last_name) couple for one specific category_id.
The query that I'm currently doing is the following:

select array_agg(id)
from users
where category_id = 5432
group by first_name, last_name;

For which the explain analyze output is the following:

 GroupAggregate  (cost=618461.35..626719.42 rows=26881 width=19) (actual time=1683.139..2613.386 rows=102943 loops=1)
   Group Key: first_name, last_name
   ->  Sort  (cost=618461.35..620441.86 rows=792206 width=19) (actual time=1683.116..2368.904 rows=849428 loops=1)
         Sort Key: first_name, last_name
         Sort Method: external merge  Disk: 25304kB
         ->  Bitmap Heap Scan on users  (cost=26844.16..524595.92 rows=792206 width=19) (actual time=86.046..229.469 rows=849428 loops=1)
               Recheck Cond: (category_id = 5432)
               Heap Blocks: exact=7938
               ->  Bitmap Index Scan on users_category_id_first_name_last_name_idx  (cost=0.00..26646.11 rows=792206 width=0) (actual time=85.006..85.006 rows=849428 loops=1)
                     Index Cond: (category_id = 5432)

What seems to greatly decrease the performance of the query is the "Sort Method: external merge Disk: 7526kB."

Is it possible to aggregate the ids without doing a sort ?
If not, what other options, apart from increasing the work_mem, do I have ?

Thanks,

Alexis

Re: Avoid sorting when doing an array_agg

От
Kiriakos Georgiou
Дата:

The array_agg() has nothing to do with it.  It’s the group by.

Without knowing what you are conceptually trying to accomplish, I can’t say much.

On my test 9.4.10 db, a similar example does a HashAggregate, so no sorting (google HashAggregate vs GroupAggregate).  But still it’s an expensive query because of all the I/O.

If I wanted to instantly have the user ids for a specific first, last name and category combo, I’d maintain a summary table via an insert trigger on the users table.

 

Kiriakos

 

 

From: <pgsql-general-owner@postgresql.org> on behalf of Alexis Woo <awoo2611@gmail.com>
Date: Friday, December 2, 2016 at 12:25 PM
To: <pgsql-general@postgresql.org>
Subject: [GENERAL] Avoid sorting when doing an array_agg

 

I have a users table which contains ~70 million rows that looks like this:

 

   Column    |       Type        |

-------------+-------------------+

 id          | integer           |

 first_name  | character varying |

 last_name   | character varying |

 category_id | integer           |

Indexes:

    "users_id_idx" btree (id)

    "users_category_id_first_name_last_name_idx" btree (category_id, first_name, last_name)

 

I'm trying to retrieve the ids for each (first_name, last_name) couple for one specific category_id.

The query that I'm currently doing is the following:

 

select array_agg(id)

from users

where category_id = 5432

group by first_name, last_name;

 

For which the explain analyze output is the following:

 

 GroupAggregate  (cost=618461.35..626719.42 rows=26881 width=19) (actual time=1683.139..2613.386 rows=102943 loops=1)

   Group Key: first_name, last_name

   ->  Sort  (cost=618461.35..620441.86 rows=792206 width=19) (actual time=1683.116..2368.904 rows=849428 loops=1)

         Sort Key: first_name, last_name

         Sort Method: external merge  Disk: 25304kB

         ->  Bitmap Heap Scan on users  (cost=26844.16..524595.92 rows=792206 width=19) (actual time=86.046..229.469 rows=849428 loops=1)

               Recheck Cond: (category_id = 5432)

               Heap Blocks: exact=7938

               ->  Bitmap Index Scan on users_category_id_first_name_last_name_idx  (cost=0.00..26646.11 rows=792206 width=0) (actual time=85.006..85.006 rows=849428 loops=1)

                     Index Cond: (category_id = 5432)

 

What seems to greatly decrease the performance of the query is the "Sort Method: external merge Disk: 7526kB."

 

Is it possible to aggregate the ids without doing a sort ?

If not, what other options, apart from increasing the work_mem, do I have ?

Thanks,

Alexis

 

Re: Avoid sorting when doing an array_agg

От
Tomas Vondra
Дата:
On Sat, 2016-12-03 at 13:08 -0500, Kiriakos Georgiou wrote:
> The array_agg() has nothing to do with it.  It’s the group by.
> Without knowing what you are conceptually trying to accomplish, I
> can’t say much.

It *IS* caused by array_agg(). PostgreSQL can only do HashAggregate
when everything fits into memory, and in this case has to deal with
aggregate states of unknown size, so assumes each state is 1kB IIRC.

Per the plan the group by is expected to produce ~27k groups, so needs
about 30MB for the HashAggregate.  

> On my test 9.4.10 db, a similar example does a HashAggregate, so no
> sorting (google HashAggregate vs GroupAggregate).  But still it’s an
> expensive query because of all the I/O.

The query does almost no I/O, actually. The bitmap heap scan takes only
~230ms, which is not bad considering it produces ~1M rows. The
expensive part here seems to be the sort, but I doubt it's because of
I/O because it only produces temporary files that likely stay in RAM
anyway.

So the sort is probably slow because of CPU, as it compares strings. In
some locales that may be very expensive - not sure which locale is used
in this case, as it was not mentioned. 

> If I wanted to instantly have the user ids for a specific first, last
> name and category combo, I’d maintain a summary table via an insert
> trigger on the users table.
>  

Maybe. The question is whether it'll be a net win - maintenance of the
summary table will not be for free, especially with arrays of ids.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Avoid sorting when doing an array_agg

От
Peter Geoghegan
Дата:
On Sat, Dec 3, 2016 at 5:20 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> So the sort is probably slow because of CPU, as it compares strings. In
> some locales that may be very expensive - not sure which locale is used
> in this case, as it was not mentioned.

I wonder what it would take to teach the optimizer to consider the
possibility of a "collation strength reduction". In other words, for
aggregates that perform a sort (or for aggregates that rely on the
presence of a sort node without there being some other dependency on
the sort node), it should be possible for the optimizer to determine
that it would be just fine to use the C locale, since the user isn't
entitled to assume anything about the exact sort order. There are of
course cases where this can make a huge difference.

--
Peter Geoghegan


Re: Avoid sorting when doing an array_agg

От
Pavel Stehule
Дата:


2016-12-04 23:12 GMT+01:00 Peter Geoghegan <pg@bowt.ie>:
On Sat, Dec 3, 2016 at 5:20 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> So the sort is probably slow because of CPU, as it compares strings. In
> some locales that may be very expensive - not sure which locale is used
> in this case, as it was not mentioned.

I wonder what it would take to teach the optimizer to consider the
possibility of a "collation strength reduction". In other words, for
aggregates that perform a sort (or for aggregates that rely on the
presence of a sort node without there being some other dependency on
the sort node), it should be possible for the optimizer to determine
that it would be just fine to use the C locale, since the user isn't
entitled to assume anything about the exact sort order. There are of
course cases where this can make a huge difference.

it is pretty good idea.

Regards

Pavel


 

--
Peter Geoghegan


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Avoid sorting when doing an array_agg

От
Tom Lane
Дата:
Peter Geoghegan <pg@bowt.ie> writes:
> I wonder what it would take to teach the optimizer to consider the
> possibility of a "collation strength reduction". In other words, for
> aggregates that perform a sort (or for aggregates that rely on the
> presence of a sort node without there being some other dependency on
> the sort node), it should be possible for the optimizer to determine
> that it would be just fine to use the C locale, since the user isn't
> entitled to assume anything about the exact sort order. There are of
> course cases where this can make a huge difference.

IMO the way to handle this would be to consider both paths that use the
straight sort order and paths that use COLLATE "C" ordering.  I think
the key structural limitation that makes it not straightforward is that
the query_planner() API supports only one target ordering
(root->query_pathkeys).  I've had a bee in my bonnet for awhile about
replacing that with a list of potentially-useful target orderings, but
haven't got round to making it happen.

Of course, we would also have to teach cost_sort or someplace near there
that non-C sorting is much more expensive than C-collation sorting.  Not
sure about exactly how to set that up without it being a kluge.

A related problem is that if you have "GROUP BY x,y" and no particular
ORDER BY requirement, you could sort by either x,y or y,x before the
GroupAgg.  This would matter if, say, there was an index matching one
but not the other.  Right now we're very stupid and only consider x,y,
but if there were room to consider more than one set of target pathkeys
it would be fairly simple to make that better.

            regards, tom lane


Re: Avoid sorting when doing an array_agg

От
Peter Geoghegan
Дата:
On Sun, Dec 4, 2016 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Of course, we would also have to teach cost_sort or someplace near there
> that non-C sorting is much more expensive than C-collation sorting.  Not
> sure about exactly how to set that up without it being a kluge.

We've talked about that before, in the context of parallel query. At
the 2014 developer meeting, IIRC.

> A related problem is that if you have "GROUP BY x,y" and no particular
> ORDER BY requirement, you could sort by either x,y or y,x before the
> GroupAgg.  This would matter if, say, there was an index matching one
> but not the other.  Right now we're very stupid and only consider x,y,
> but if there were room to consider more than one set of target pathkeys
> it would be fairly simple to make that better.

That sounds valuable, especially because it seems natural to make the
leading group-on var the least selective within a GROUP BY; having a
matching index that you can thereby use might be less common than that
in practice, unless and until the partial sort patch is committed.

I will tend to write "GROUP BY country, province, city" -- never
"GROUP BY city, province, country". I speak a language that is written
left-to-right, but it would be the opposite way around in both
directions if I spoke a language written right-to-left, I bet. Same
difference. This might be a very prevalent habit. In general, a
tuplesort will be faster with a high cardinality leading attribute, so
this habit works against tuplesort. (Assuming a leading attribute of
pass-by-value type, or with abbreviated key support.)

--
Peter Geoghegan