Re: Parallel Aggregates for string_agg and array_agg
От | David Rowley |
---|---|
Тема | Re: Parallel Aggregates for string_agg and array_agg |
Дата | |
Msg-id | CAKJS1f9noir4YyR=XzGFDoTwian-dOKyLPaVFMCiC8C1K-F83w@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Parallel Aggregates for string_agg and array_agg (Tomer Praizler <tomer.praizler@gmail.com>) |
Список | pgsql-novice |
On Thu, 14 Mar 2019 at 12:43, Tomer Praizler <tomer.praizler@gmail.com> wrote: > my query looks something like this: > > SELECT x.timestamp, x.port, x.super_name, max(x.timestamp) AS last_seen, coalesce(array_length(array_merge_agg(x.ids),1), 0) AS my_count, coalesce(array_length(array_agg(DISTINCT x.name), 1), 0)AS names, sum(x.count) AS final_count > FROM x > GROUP BY x.timestamp, x.port, x.super_name > ORDER BY sum(x.count) > > This result in a plan without parallel execution because of the array_agg on a string field. when I remove it the queryplanner spawns a parallel execution plan. > It reduces the time from 5 minutes to around 1 minute which is also a lot. (if there is any idea on how to optimize fartherplease help:) btw, hardware resources is not a problem) I think most of that speedup will be coming from removing the aggregate that has a DISTINCT rather than removing it because it can't be parallelised. You could find that out by seeing how fast it is without the array_agg after having SET max_parallel_workers_per_gather TO 0; FWIW, even with the patch you mentioned earlier, this query would not use parallel aggregates due to that DISTINCT. All other aggregates don't store the individual aggregated values, so ordinarily, it's not possible for the aggregate combine phase to combine the aggregate states from parallel workers and ignore the values that are duplicated between workers. array_agg does happen to store the individual values, but the parallel aggregate infrastructure has no support in the combine phase to combine and eliminate the duplicates. I don't think it would be entirely impossible to add it, but, out of the standard set of aggregate functions, it would only be string_agg and array_agg along with some xml and json aggs that could use that, so it might not be worth the trouble. The reason your query will be taking so long with the array_agg(DISTINCT ...) is that internally the executor performs a sort of the entire results so that it can perform the duplicate elimination by checking if the current value is the same as the previous. I think the best thing that can be done to speed that up is the thing that Tom hints at in the final paragraph in [1]. This would have the planner consider choosing a plan that provides pre-sorted input to the aggregate node. In your case that would be x.timestamp, x.port, x.super_name, x.name. If there was an index supporting that then no sorting would need to be done for the aggregation phase at all. Anyway, that's all future stuff that does not exist today. For today, I think the only way you'll get an improvement is to somehow precalculate the DISTINCT x.name. Alternatively, you could consider some normalisation and have x.name be an int column and put the actual names into another table, then just look those up after aggregation. This would only help in the sense that sorting an int column is faster and uses less memory than sorting a text column. [1] https://www.postgresql.org/message-id/22068.1522103326%40sss.pgh.pa.us -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
В списке pgsql-novice по дате отправления:
Следующее
От: Andreas KretschmerДата:
Сообщение: Re: Elegant way to insert of some value in some tables in shortinstruction