Обсуждение: Many joins: monthly summaries S-L--O--W
Hello all!
I've been working with PostgreSQL now for about a year and have really
enjoyed it. And I've been able to impress the boss with very simple
PHP/PostgreSQL web applications, so that's been great too!
Nearly all I've done has been very straightforward and basic: simple
selects from a couple of different tables joined by where clauses. But
now I'd like to do something a bit more complex: show sales per month,
current inventory, and expected duration of inventory at present sales
rates for each item we have. All in one relation :)
Here are the relevant relations
products (code TEXT PK, name TEXT)
orders (product_code TEXT FK products(id), date DATE, branch_id INTEGER
FK branches(id), qty INTEGER)
inventory (product_code INTEGER FK products(id), date DATE, qty INTEGER)
What I'm trying to make is a table with columns
product_code,
product_name,
jan03_qty,...,
half1_qty,
jul03_qty, ...
sep03_qty,
half2_qty,
total_qty,
inv_qty,
inv_date,
est_inv_qty,
months_remaining
where product_name is products(name), jan03_qty is the quantity of
sales in January, 2003, half1_qty is quantity of sales from January
through June, 2003, inv_qty is the latest inventory data we have,
inv_date is the date of that inventory data, est_inv_qty is the
estimated current inventory based on inv_qty and sales since inv_date,
and months_remaining is an estimate of how many months the estimated
inventory will last at average sales rates (calculated using the
average monthly sales for the previous months).
I've got something that works, but it's *extremely* slow. It takes
about 10 minutes running on a 500MHz G4 Mac with 1GB RAM, running OS
10.2.8 and PostgreSQL 7.3.3. Here's what I've done:
select
products.code as product_code,
products.name as product_name,
jan03.qty as jan03_qty,
feb03.qty as feb03_qty ...
inv.qty as inv_qty,
est_inv.qty as est_inv_qty,
months_remaining::numeric(8,1)
from products
left join (
select products.code as product_code, sum(qty) as qty
from orders, products
where products.code = orders.id and
date between '2003-01-01' and '2003-01-31'
group by product_code) as
jan03 on (jan03.product_code = products.code)
left join (
select products.code as product_code, sum(qty) as qty
from orders, products
where products.code = orders.id and
date between '2003-02-01' and '2003-02-28'
group by product_code) as
feb03 on (feb03.product_code = products.code)
left join
-- repeat through total_qty
total on (total.product_code = products.code)
left join (
-- this is where it get's hairy
select
est_inventory.product_code as product_code,
est_inventory.qty,
monthly_averages.monthly_average,
(est_inventory.qty/monthly_average)::numeric(10,1)
as months_remaining
from (
select
inventory.product_code as product_code,
inventory.qty - coalesce(orders.qty,0) as qty
from (
select product_code, date, qty
from current_inventory_view
) as
inventory
left outer join (
select orders.product_code as product_code,
sum(orders.qty) as qty
from (
select product_code, date, qty
from current_inventory_view
) as
inventory,
orders
where
orders.date > inventory.date and
orders.product_code = inventory.product_code
group by orders.product_code
) as
orders on (inventory.product_code = orders.product_code)
) as
est_inventory
left outer join (
select
product_code as product_code,
sum(qty)/ageinmonths(timestamp '9/30/2003',
timestamp '1/1/2003') as monthly_average
from orders
where date between '1/1/2003' and '9/30/2003'
group by product_code
) as
monthly_averages on
(est_inventory.product_code = monthly_averages.product_code)
where monthly_average > 0
) as
remaining on (remaining.product_code = products.code)
left join (
select distinct product_code, date, qty
from current_inventory_view order by date desc
) as
inventory on (inventory.product_code = products.code)
; -- finally
ageinmonths is an sql function that returns a double precision float,
the number of months between $1 and $2, defined as follows
select
12 * date_part ('year',age($1, $2)) +
date_part('month',age($1,$2)) +
date_part('day'), age($1, $2))/30;
current_inventory_view, showing the most recent inventory qty and date,
is defined as
SELECT inv.date, inv.product_code, inv.qty FROM inventory inv, (SELECT
max(inventory.date) AS date, inventory.product_code FROM inventory
GROUP BY inventory.product_code) curr_inv WHERE ((inv.date =
curr_inv.date) AND (inv.product_code = curr_inv.product_code));
orders is indexed on date and product_code (and branch_id, but that
shouldn't matter)
inventory is indexed on date
products is unindexed besides the order_pkey index that's
auto-generated.
I'm using left joins because not all items sell each month. Items that
have zero sales would fall out if I used WHERE clauses.
Now, I've done an EXPLAIN on the monstrosity and it comes to just over
3 pages printed at 8 pt on A3 landscape. It's much more than I can take
in. I've looked through the EXPLAIN documentation and am willing to
learn (which is good, because I definitely need to).
I tried a version of this, truncating the 'hairy' part. It definitely
ran faster, taking less than 5 minutes.
Also, this query requires more upkeep than I'd like. I modify it every
time I want to add a new month. It seems like I should be able to use a
custom function so I can just put in the date span I'd like to cover.
I've started to work on a function and have run into some problems (see
separate post: Custom function problems if you're interested )
Any suggestion would be appreciated. It seems like this should be
something that's easy, and I'm approaching it wrong. If I'm completely
on the wrong track, I'd love to know! Also, places I should/could look
for ways to accomplish this, that'd be great.
If you've made it to here, thanks for your perseverance! : )
Regards,
Michael Glaesemann
grzm myrealbox com
Micheal, > where product_name is products(name), jan03_qty is the quantity of > sales in January, 2003, half1_qty is quantity of sales from January > through June, 2003, inv_qty is the latest inventory data we have, > inv_date is the date of that inventory data, est_inv_qty is the > estimated current inventory based on inv_qty and sales since inv_date, > and months_remaining is an estimate of how many months the estimated > inventory will last at average sales rates (calculated using the > average monthly sales for the previous months). Well, I wouldn't recommend your left outer join approach. You're making the query do far too much work. There are a number of different ways to solve the "crosstab" problem, and the outer join method is only really good for small data sets. I'd suggest instead that you use the "aggregate grid" method: Construct a table like this, called month_xtab: month_no jan_ct feb_ct mar_ct .... half_ct half2_ct 1 1 0 0 1 0 2 0 1 0 1 0 3 0 0 1 1 0 ... 12 0 0 0 0 1 Then you can do monthly crosstabs like: SELECT item, SUM(no_sold * jan_ct) as jan_qty, SUM(no_sold * feb_ct) as feb_qty .... SUM (no_sold * half_ct) as half_qty, SUM(no_sold) as tot_qty FROM sales, month_xtab WHERE (extract(month from sales_date) = month_no and extract(year from sales_date) = 2003) This is much, much faster than the outer join method for large numbers of columns. For better performance, make an index on extract(month from sales_date). This doesn't solve your whole query problem, but it's a good start. This solution, and other tips, can be found in Joe Celko's "SQL for Smarties" -- Josh Berkus Aglio Database Solutions San Francisco
Michael Glaesmann <grzm@myrealbox.com> writes:
> [ a very messy query ]
It does seem like that is a big query with a small query struggling to
get out. I don't have any immediate suggestions, but I wonder whether
you wouldn't profit by reading something about how to do crosstabs and
statistics in SQL. Joe Celko's book "SQL For Smarties" is an invaluable
resource for hard problems in SQL (be sure to get the 2nd edition).
Also, if you decide that a crosstab would help, take a look at
the crosstab functions in contrib/tablefunc. (Celko's book only covers
SQL-standard solutions, not methods that depend on nonstandard features,
so he's at a disadvantage when covering crosstab methods.)
Also, I do have a suggestion for this:
> current_inventory_view, showing the most recent inventory qty and date,
> is defined as
> SELECT inv.date, inv.product_code, inv.qty FROM inventory inv, (SELECT
> max(inventory.date) AS date, inventory.product_code FROM inventory
> GROUP BY inventory.product_code) curr_inv WHERE ((inv.date =
> curr_inv.date) AND (inv.product_code = curr_inv.product_code));
If you don't mind using a Postgres-specific feature, you should be able
to make this view faster by using DISTINCT ON. Look at the "weather
reports" example in the SELECT reference page.
regards, tom lane
Functional index problems. (Was: Many joins: monthly summaries S-L--O--W)
От
Michael Glaesmann
Дата:
I've implemented the aggregate grid method, building a "xtab" table as
outlined. The orders relation has about 300K records, which isn't that
big from the numbers thrown around on these lists. The basic query
(just the monthly columns, none of the inventory mess) took about 2
minutes. That's going through the 300K records and creating monthly
totals. Something tells me there's room for improvement, so I'll keep
trying. (Adding the inventory joins slows it back down to about 10
minutes, so there's a lot of room for improvement there, but one thing
at a time, right?)
On Wednesday, Oct 22, 2003, at 01:48 Asia/Tokyo, Josh Berkus wrote:
<excerpt>I'd suggest instead that you use the "aggregate grid" method:
</excerpt><<snip/>
<excerpt>This is much, much faster than the outer join method for
large numbers of
columns. For better performance, make an index on extract(month from
sales_date).
</excerpt>
Searching for ways to improve performance, I tried to create a index
on the extract function, but for some reason I couldn't get it to
work. Following the documentation for CREATE INDEX and EXTRACT, I tried
CREATE INDEX dborders_extract_month_idx ON dborders (EXTRACT(MONTH
from date));
which gave me
ERROR: parser: parse error at or near "(" at character 61
I also tried
CREATE INDEX dborders_extract_month_idx ON dborders EXTRACT(MONTH from
date);
ERROR: parser: parse error at or near "EXTRACT" at character 53
and just for good measure, not that I thought it work
CREATE INDEX dborders_extract_month_idx on dborders EXTRACT(MONTH from
(date));
ERROR: parser: parse error at or near "EXTRACT" at character 53
What am I missing from the CREATE statement? It seems in line with the
documentation for CREATE INDEX:
<fontfamily><param>Courier</param>CREATE [ UNIQUE ] INDEX
<italic>index_name </italic>ON <italic>table </italic>[ USING
<italic>acc_method </italic>] ( <italic>func_name
</italic>(<italic>column </italic>[, ... ]) [ <italic>ops_name
</italic>] )
[ WHERE <italic>predicate </italic>]</fontfamily>
and the example given in 8.5 Functional Indexes of the User's Guide:
<fontfamily><param>Courier</param>CREATE INDEX test1_lower_col1_idx ON
test1 (lower(col1));
I know this isn't the solution to all my problems, but I am a bit
curious why my CREATE INDEX isn't working.
Further debugging attempts:
Try a different fuction.
CREATE INDEX dborders_date_trunc_idx on dborders
(date_trunc('month',date));
This failed with
ERROR: parser: parse error at or near "'month'" at character 62
I even renamed the 'date' column (type timestamp with timezone) to
'datetime' on the off chance that using and SQL key word (though not a
PostgreSQL key word) was causing some problem. Same errors.
What is that simple thing I'm overlooking? Any ideas what else I
should check? (I did a search on 'functional index' in the list
archives but kept getting timed out :(
Michael
</fontfamily>
I've implemented the aggregate grid method, building a "xtab" table as
outlined. The orders relation has about 300K records, which isn't that
big from the numbers thrown around on these lists. The basic query
(just the monthly columns, none of the inventory mess) took about 2
minutes. That's going through the 300K records and creating monthly
totals. Something tells me there's room for improvement, so I'll keep
trying. (Adding the inventory joins slows it back down to about 10
minutes, so there's a lot of room for improvement there, but one thing
at a time, right?)
On Wednesday, Oct 22, 2003, at 01:48 Asia/Tokyo, Josh Berkus wrote:
> I'd suggest instead that you use the "aggregate grid" method:
<snip/>
> This is much, much faster than the outer join method for large numbers
> of
> columns. For better performance, make an index on extract(month from
> sales_date).
Searching for ways to improve performance, I tried to create a index on
the extract function, but for some reason I couldn't get it to work.
Following the documentation for CREATE INDEX and EXTRACT, I tried
CREATE INDEX dborders_extract_month_idx ON dborders (EXTRACT(MONTH from
date));
which gave me
ERROR: parser: parse error at or near "(" at character 61
I also tried
CREATE INDEX dborders_extract_month_idx ON dborders EXTRACT(MONTH from
date);
ERROR: parser: parse error at or near "EXTRACT" at character 53
and just for good measure, not that I thought it work
CREATE INDEX dborders_extract_month_idx on dborders EXTRACT(MONTH from
(date));
ERROR: parser: parse error at or near "EXTRACT" at character 53
What am I missing from the CREATE statement? It seems in line with the
documentation for CREATE INDEX:
CREATE [ UNIQUE ] INDEX index_name ON table [ USING acc_method ] (
func_name (column [, ... ]) [ ops_name ] )
[ WHERE predicate ]
and the example given in 8.5 Functional Indexes of the User's Guide:
CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1));
I know this isn't the solution to all my problems, but I am a bit
curious why my CREATE INDEX isn't working.
Further debugging attempts:
Try a different fuction.
CREATE INDEX dborders_date_trunc_idx on dborders
(date_trunc('month',date));
This failed with
ERROR: parser: parse error at or near "'month'" at character 62
I even renamed the 'date' column (type timestamp with timezone) to
'datetime' on the off chance that using and SQL key word (though not a
PostgreSQL key word) was causing some problem. Same errors.
What is that simple thing I'm overlooking? Any ideas what else I should
check? (I did a search on 'functional index' in the list archives but
kept getting timed out :(
Michael
Michael Glaesmann <grzm@myrealbox.com> writes:
> Searching for ways to improve performance, I tried to create a index on
> the extract function, but for some reason I couldn't get it to work.
> Following the documentation for CREATE INDEX and EXTRACT, I tried
> CREATE INDEX dborders_extract_month_idx ON dborders (EXTRACT(MONTH from
> date));
> ERROR: parser: parse error at or near "(" at character 61
You can't do that in pre-7.4 releases; the syntax of a functional index
can only be "ON table (func(col1,col2,...))" --- that is, a simple,
standard-notation function applied to one or more columns of the table.
So to do this, you'd need to create an intermediate function along
the lines of "month_trunc(date)"; and you'd have to use it in your
queries as well as in the index definition.
7.4 is more flexible though --- it will take the above as long as you
put an extra set of parentheses in there...
regards, tom lane
Michael,
> What is that simple thing I'm overlooking? Any ideas what else I should
> check? (I did a search on 'functional index' in the list archives but
> kept getting timed out :(
Oh, sorry. There's an implementation issue with funcional indexes, where they
can't take parameters other than column names. So you need to do:
CREATE FUNCTION get_month (
TIMESTAMPTZ ) RETURNS INTEGER AS
' SELECT EXTRACT(MONTH from $1); '
LANGUAGE sql IMMUTABLE STRICT;
Then do
CREATE INDEX dborders_date_trunc_idx on dborders
(get_month(date));
--
-Josh Berkus
Aglio Database Solutions
San Francisco
Re: Functional index problems. (Was: Many joins: monthly summaries S-L--O--W)
От
Michael Glaesmann
Дата:
On Thursday, Oct 23, 2003, at 02:44 Asia/Tokyo, Tom Lane wrote:
> Michael Glaesmann <grzm@myrealbox.com> writes:
>> CREATE INDEX dborders_extract_month_idx ON dborders (EXTRACT(MONTH
>> from
>> date));
>> ERROR: parser: parse error at or near "(" at character 61
>
> You can't do that in pre-7.4 releases; the syntax of a functional index
> can only be "ON table (func(col1,col2,...))" --- that is, a simple,
> standard-notation function applied to one or more columns of the table.
> So to do this, you'd need to create an intermediate function along
> the lines of "month_trunc(date)"; and you'd have to use it in your
> queries as well as in the index definition.
On Thursday, Oct 23, 2003, at 02:53 Asia/Tokyo, Josh Berkus wrote:
> Oh, sorry. There's an implementation issue with funcional indexes,
> where they
> can't take parameters other than column names. So you need to do:
>
> CREATE FUNCTION get_month (
> TIMESTAMPTZ ) RETURNS INTEGER AS
> ' SELECT EXTRACT(MONTH from $1); '
> LANGUAGE sql IMMUTABLE STRICT;
Thanks, Tom and Josh! Added a type cast of the extract (which returns a
double precision) and it's all good.
Tom commented:
> 7.4 is more flexible though --- it will take the above as long as you
> put an extra set of parentheses in there...
I took a gander at the documentation for 7.4beta. I can tell it's been
reorganized. There's no longer a specific section on functional indexes
that I can see, though there is mention of it in the SQL CREATE INDEX
entry
<http://developer.postgresql.org/docs/postgres/sql-createindex.html>
The 7.3.2 documents I have say that there cannot be multicolumn
functional indexes, though there's no mention of this in the new
documentation. Does this mean this proscription has been lifted?
Thanks again for your help!
Michael
Michael Glaesmann <grzm@myrealbox.com> writes: > On Thursday, Oct 23, 2003, at 02:44 Asia/Tokyo, Tom Lane wrote: >> 7.4 is more flexible though --- it will take the above as long as you >> put an extra set of parentheses in there... > I took a gander at the documentation for 7.4beta. I can tell it's been > reorganized. There's no longer a specific section on functional indexes > that I can see, though there is mention of it in the SQL CREATE INDEX > entry > <http://developer.postgresql.org/docs/postgres/sql-createindex.html> > The 7.3.2 documents I have say that there cannot be multicolumn > functional indexes, though there's no mention of this in the new > documentation. Does this mean this proscription has been lifted? Yes. Any column of an index can now be an expression; the former functional-index capability is now just a special case of "expressional indexes". For syntactic reasons we had to require an extra pair of parens around expressions in the CREATE INDEX statement --- although it proved possible to not require these when the expression looks like a standard-syntax function call, thus providing backwards compatibility with the old functional-index syntax. See http://developer.postgresql.org/docs/postgres/indexes-expressional.html as well as the CREATE INDEX reference page. I'm not particularly happy with the phrase "expressional index", and would like to think of something else to use before the 7.4 docs go final. Any ideas? regards, tom lane
Michael, > <http://developer.postgresql.org/docs/postgres/sql-createindex.html> > The 7.3.2 documents I have say that there cannot be multicolumn > functional indexes, though there's no mention of this in the new > documentation. Does this mean this proscription has been lifted? Yes. -- -Josh Berkus Aglio Database Solutions San Francisco
Re: Functional index problems. (Was: Many joins: monthly summaries S-L--O--W)
От
Michael Glaesmann
Дата:
On Thursday, Oct 23, 2003, at 04:17 Asia/Tokyo, Tom Lane wrote: <excerpt>I'm not particularly happy with the phrase "expressional index", and would like to think of something else to use before the 7.4 docs go final. Any ideas? </excerpt> Though it might be nice to be creative and name it something like "Bill", for a defined language like SQL I think it's best to work within the framework already used to describe the language. Looking in the PostgreSQL docs (which I'm sure you've done), I find this definition of "value expression" <fontfamily><param>Lucida Grande</param><bigger>A value expression is one of the following: - A constant or literal value; see <color><param>0000,0000,6666</param>Section 1.1.2 </color>. - A column reference. - A positional parameter reference, in the body of a function declaration. - An operator invocation. - A function call. - An aggregate expression. - A type cast. - A scalar subquery. - Another value expression in parentheses, useful to group subexpressions and override precedence. Preceding this is a paragraph mentioning that value expressions are also called "scalar expressions" or just "expressions". I was surprised to find column references are included in this list, but thinking about it, a column has a scalar value, which is what any function or operator returns as well. According to the above description of "value expression", a normal column index is just a special case of the more general "expression" index. (Actually, I'm a bit confused with the definition of value expression. Does column reference here mean value in the same row (and same table)? That's the only way I can see getting a scalar value. If it means reference in the constraint meaning, it's a set of allowed values, rather than a specific one, which definitely wouldn't work for an index. Perhaps the use of the word "reference" here is inappropriate, though I haven't looked through the rest of the documentation to see how the term "column reference" is used. My bad, I know. I'll definitely accept criticism for not doing my homework thoroughly.) To me, an operator is just a special class of function that doesn't use the more general func(arg [, arg]) construction. 2 + 2 could just as easily be add(2,2) — and maybe it is already defined this way as well. I haven't checked. However, I can see that it might be useful to distinguish between functions and column references. One term I thought about was "derived index". Another was "evaluated index". But both of these terms seem to introduce unnecessary language. My final thought would be to continue using "functional index", noting that operators are a special class of function. Out of the above description of "value expression", I believe only functions and operators are allowed as expressions in the CREATE INDEX syntax, correct? (Besides referring to a column.) What do you think? Michael</bigger></fontfamily>On Thursday, Oct 23, 2003, at 04:17 Asia/Tokyo, Tom Lane wrote: > I'm not particularly happy with the phrase "expressional index", and > would like to think of something else to use before the 7.4 docs go > final. Any ideas? Though it might be nice to be creative and name it something like "Bill", for a defined language like SQL I think it's best to work within the framework already used to describe the language. Looking in the PostgreSQL docs (which I'm sure you've done), I find this definition of "value expression" A value expression is one of the following: - A constant or literal value; see Section 1.1.2 . - A column reference. - A positional parameter reference, in the body of a function declaration. - An operator invocation. - A function call. - An aggregate expression. - A type cast. - A scalar subquery. - Another value expression in parentheses, useful to group subexpressions and override precedence. Preceding this is a paragraph mentioning that value expressions are also called "scalar expressions" or just "expressions". I was surprised to find column references are included in this list, but thinking about it, a column has a scalar value, which is what any function or operator returns as well. According to the above description of "value expression", a normal column index is just a special case of the more general "expression" index. (Actually, I'm a bit confused with the definition of value expression. Does column reference here mean value in the same row (and same table)? That's the only way I can see getting a scalar value. If it means reference in the constraint meaning, it's a set of allowed values, rather than a specific one, which definitely wouldn't work for an index. Perhaps the use of the word "reference" here is inappropriate, though I haven't looked through the rest of the documentation to see how the term "column reference" is used. My bad, I know. I'll definitely accept criticism for not doing my homework thoroughly.) To me, an operator is just a special class of function that doesn't use the more general func(arg [, arg]) construction. 2 + 2 could just as easily be add(2,2) — and maybe it is already defined this way as well. I haven't checked. However, I can see that it might be useful to distinguish between functions and column references. One term I thought about was "derived index". Another was "evaluated index". But both of these terms seem to introduce unnecessary language. My final thought would be to continue using "functional index", noting that operators are a special class of function. Out of the above description of "value expression", I believe only functions and operators are allowed as expressions in the CREATE INDEX syntax, correct? (Besides referring to a column.) What do you think? Michael
Michael Glaesmann <grzm@myrealbox.com> writes:
> My final thought would be to continue using "functional index", noting
> that operators are a special class of function. Out of the above
> description of "value expression", I believe only functions and
> operators are allowed as expressions in the CREATE INDEX syntax,
> correct? (Besides referring to a column.)
Actually, you can use any scalar-valued expression that does not contain
a sub-SELECT or aggregate function. (Aggregates would be meaningless
--- what are you aggregating over? The prohibition of sub-SELECTs is
partly because we don't do planning for index expressions, which could
be fixed if anyone were sufficiently motivated, and partly because it
seems quite unlikely that the result of a SELECT could reasonably be
considered immutable. We do *not* want to get into updating indexes for
reasons other than updates of the indexed row.)
I see your point that "functional index" is still a valid description,
but I'm inclined to bow to Peter's position that it has another meaning
that could distract people. Presently I'm going to go with "expression
index" unless someone can provide a better choice.
regards, tom lane