Обсуждение: query patterns for multipass aggregating

Поиск
Список
Период
Сортировка

query patterns for multipass aggregating

От
Rob Nikander
Дата:
Hi,

I’ve run into this pattern a few times, and I usually get a little confused. I’m wondering if there are some common
solutionsor techniques. 

Simplified example:

I have tables `items`, `colors`, and `images`. Items have many colors, and many images.

I want a query to list items, with their colors and images. Each result row is an item, so the colors and images must
beaggregated into arrays or json.  

If there were only one other table, it’s a simple join and group…

    select items.*, array_agg(color_name)
    from items join colors on items.id = colors.item_id
    group by items.id

Now to get images too, my first try uses a CTE…

    // Tables:
    // items (id, name, foo)
    // colors (item_id, color_name, color_foo)
    // images (item_id, image_file, image_foo)

    with items_with_colors as (
      // This is the same query as above
      select items.*, array_agg(colors.color_name) as color_names
      from items
      join colors on items.id = colors.item_id
      group by items.id
    )
    // Same idea repeated but now joining to images table
    select items.*, array_agg(images.image_file) as image_files
    from items_with_colors items
    join images on items.id = images.item_id
    group by items.id, items.name, items.foo, items.color_names;  // mmm :(

One first problem is on the last line. I have to list out all the columns. In my real situation there are many more.
Listingthem is only an inconvenience, but I’m more worried that it is internally comparing all the columns, when really
itcould compare only `items.id`. 

So… are there some other techniques that usually (or might) work better? It seems like the kind of thing that could run
inparallel, if I wrote it right.  

Rob



Re: query patterns for multipass aggregating

От
Andrew Gierth
Дата:
>>>>> "Rob" == Rob Nikander <rob.nikander@gmail.com> writes:

 Rob> I want a query to list items, with their colors and images. Each
 Rob> result row is an item, so the colors and images must be aggregated
 Rob> into arrays or json.

 Rob> If there were only one other table, it’s a simple join and group…

 Rob>     select items.*, array_agg(color_name)
 Rob>     from items join colors on items.id = colors.item_id
 Rob>     group by items.id

Method 1:

select items.*, c.colors, i.images
  from items
       left join (select item_id, array_agg(color_name) as colors
                    from colors
                   group by item_id) c
         on c.item_id=items.id
       left join (select item_id, array_agg(image) as images
                    from images
                   group by item_id) i
         on i.item_id=items.id;

Method 2:

select items.*, c.colors, i.images
  from items
       left join lateral (select array_agg(c0.color_name) as colors
                            from colors c0
                           where c0.item_id=items.id) c
         on true
       left join lateral (select array_agg(i0.image) as images
                            from images i0
                           where i0.item_id=items.id) i
         on true;

Unfortunately, the planner isn't smart enough yet to know that these two
are equivalent, so they generate different sets of possible query plans.
Method 1 gets plans that work well if the entire items table is being
selected, since it will read the whole of the images and colors tables
in one go, and it will also get plans that work well for reading a
_single_ item selected by WHERE items.id=? because equivalence-class
processing will propagate a copy of that condition down to below the
grouping clauses. It will _not_ get a good plan for reading any other
small subset of items (e.g. selected by other criteria); for this you
need method 2, which in turn doesn't get very good plans when you fetch
the whole items table.

Don't be tempted to use CTEs for the subqueries in either plan; that'll
only make it much worse.

--
Andrew (irc:RhodiumToad)


Re: query patterns for multipass aggregating

От
Andrew Gierth
Дата:
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

 Andrew> Unfortunately, the planner isn't smart enough yet to know that
 Andrew> these two are equivalent,

oops, I edited the second one before posting in a way that made them not
be equivalent: adding a "group by x0.item_id" in both subqueries in
method 2 makes them equivalent again. Without that, the result differs
slightly if there are no matching color or image rows (NULL vs. empty
array).

-- 
Andrew (irc:RhodiumToad)


Re: query patterns for multipass aggregating

От
Rob Nikander
Дата:

> On Nov 11, 2018, at 12:54 PM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
> …

Thank you that is very helpful.

Could the CTE’s theoretically be optimized in a future version of PG, to work like the subqueries? I like to use them
togive names to intermediate results, but I’ll stay away from them for now. 

Rob

Re: query patterns for multipass aggregating

От
Ondřej Bouda
Дата:
Dne 11.11.2018 v 17:20 Rob Nikander napsal(a):
 > I have tables `items`, `colors`, and `images`. Items have many 
colors, and many images.
 >
 > I want a query to list items, with their colors and images. Each 
result row is an item, so the colors and images must be aggregated into 
arrays or json.

What about subqueries?

SELECT
     items.*,
     (SELECT array_agg(color_name) FROM colors WHERE item_id = items.id) 
AS color_names,
     (SELECT array_agg(image_file) FROM images WHERE item_id = items.id) 
AS image_files
FROM items

According to my experience, not only the code is readable (no questions 
about how many rows are there for each item), but it also leads to a 
better query plan.


Dne 11.11.2018 v 19:23 Rob Nikander napsal(a):
> Could the CTE’s theoretically be optimized in a future version of PG, to work like the subqueries? I like to use them
togive names to intermediate results, but I’ll stay away from them for now.
 

It is on the todo list (for quite some time already):
https://blog.2ndquadrant.com/postgresql-ctes-are-optimization-fences/
Until then, using CTEs in situations as yours leads to poor query plans 
as Postgres will compute the whole CTE separately (i.e., all rows 
involved) and only then picks some of them to the inner table.

Regards,
Ondřej Bouda


Re: query patterns for multipass aggregating

От
Andrew Gierth
Дата:
>>>>> "Ondřej" == Ondřej Bouda <obouda@email.cz> writes:

 Ondřej> What about subqueries?

 Ondřej> SELECT
 Ondřej>     items.*,
 Ondřej>     (SELECT array_agg(color_name) FROM colors WHERE item_id =
 Ondřej> items.id) AS color_names,
 Ondřej>     (SELECT array_agg(image_file) FROM images WHERE item_id =
 Ondřej> items.id) AS image_files
 Ondřej> FROM items

 Ondřej> According to my experience, not only the code is readable (no
 Ondřej> questions about how many rows are there for each item), but it
 Ondřej> also leads to a better query plan.

This is (generally speaking) no improvement over the LATERAL method I
showed, and is less flexible (for example it's very painful to return
more than one value from the subqueries).

--
Andrew (irc:RhodiumToad)