Обсуждение: Tricky math in PSQL
Hey list, I am having problems trying to figure out the best way to implement a complicated function in PostgreSQL. I have tried my best to articulate my problem here: http://pastebin.com/zwWmYMd0 To understand it mathematically is not so hard, but to express it in PSQL syntax is hard for me. I am aware that the views and functions I have may not be the most efficient. I am open to suggestions for those too. -- Kip Warner -- Senior Software Engineer OpenPGP encrypted/signed mail preferred http://www.thevertigo.com
Вложения
On Fri, Apr 22, 2016 at 1:44 PM, Kip Warner <kip@thevertigo.com> wrote: > I am having problems trying to figure out the best way to implement a > complicated function in PostgreSQL. I have tried my best to articulate > my problem here: > > http://pastebin.com/zwWmYMd0 Please include such information in the post, either directly or as an attachment. Messages on these lists go back over 18 years at this point, and while you did flag that paste as "never delete", who's to say whether pastebin will be around to keep the promise that the URL will be valid 18 years from now? > To understand it mathematically is not so hard, but to express it in > PSQL syntax is hard for me. > > I am aware that the views and functions I have may not be the most > efficient. I am open to suggestions for those too. People can best help you if you provide a self-contained test case -- on that actually runs from an empty database. You will get more responses, and are more likely to get to a good solution, if you simplify the example as far as possible while still showing the basic problem -- without ellipses or other non-executable visual cues. You may want to review this page and post again: https://wiki.postgresql.org/wiki/SlowQueryQuestions Also, keep in mind that plpgsql is not the only language available within PostgreSQL; for statistics you might, for example, consider plr. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 2016-04-22 at 14:37 -0500, Kevin Grittner wrote: > Please include such information in the post, either directly or as > an attachment. Messages on these lists go back over 18 years at > this point, and while you did flag that paste as "never delete", > who's to say whether pastebin will be around to keep the promise > that the URL will be valid 18 years from now? Hey Kevin, Noted. In general you are right. Because these queries are a bit long and sometimes wrap in ugly ways in folks' MUAs making them harder to read, you'll have to forgive me for this one. > People can best help you if you provide a self-contained test case > -- on that actually runs from an empty database. You will get more > responses, and are more likely to get to a good solution, if you > simplify the example as far as possible while still showing the > basic problem -- without ellipses or other non-executable visual > cues. Ok, it took me a while to figure out how to populate the tables with sample random data, but I believe I have a complete self-contained example that executes. http://pastebin.com/zwWmYMd0 The two tables statistics and statistics_child are populated with a large number of rows. Thank you for your help, Kevin. -- Kip Warner -- Senior Software Engineer OpenPGP encrypted/signed mail preferred http://www.thevertigo.com
Вложения
On Mon, Apr 25, 2016 at 8:46 PM, Kip Warner <kip@thevertigo.com> wrote: > On Fri, 2016-04-22 at 14:37 -0500, Kevin Grittner wrote: >> Please include such information in the post, either directly or as >> an attachment. Messages on these lists go back over 18 years at >> this point, and while you did flag that paste as "never delete", >> who's to say whether pastebin will be around to keep the promise >> that the URL will be valid 18 years from now? > > Hey Kevin, > > Noted. In general you are right. Because these queries are a bit long > and sometimes wrap in ugly ways in folks' MUAs making them harder to > read, you'll have to forgive me for this one. No, we don't :-). Anyways, the relevant portion is here. -- I would like to know the most efficient way of calculating the following, -- preferably within the DBMS, to avoid having to shift intermediate -- calculations back and forth between the client application. It is easy to -- write and understand mathematically, but translating into PSQL is tricky... --total_diff(id_a, id_b) = -- (statistic_constant * statistic_diff(id_a, id_b)) + -- (statistic_child_constant * -- sigma( -- sort_ordinal=1, -- sort_ordinal= -- min(COUNT(statistics_child.id=id_a), COUNT(statistics_child.id=id_b)), -- statistic_child_diff(id_a, id_b, sort_ordinal)) / constantG)) -- ...the sigma sums all statistic_child_diff() for the two id's to compare -- for sort_ordinal 1,2,3,...to N where N is the lesser of the number of rows in -- either statistics_child with id_a or id_b. I'm not going to work out the whole problem for you, but you probably want to start looking in the direction custom aggregate function. Most people are not aware of this but custom aggregates can be applied over a window. The state can be a composite type and thus hold arbitrarily complex information that you can maintain as you walk through the set (this works great for time series and things like that). Barring that, if you are still looking for solutions to this type of problem in the database, I'd point you in the direction of pl/r. The learning curve is somewhat steep but rewards mastery. CREATE AGGREGATE RateChange(NUMERIC, NUMERIC, INT) ( SFUNC = AggRateChange, STYPE = NUMERIC ); CREATE OR REPLACE FUNCTION AggRateChange( State NUMERIC, Seed NUMERIC, RateChange NUMERIC, NDigits INT) RETURNS NUMERIC AS $$ SELECT ROUND(COALESCE($1, $2) * (1.0 + $3), NDigits); $$ LANGUAGE SQL IMMUTABLE; select ratechange(100, (random() * 0.05)::numeric, 3) over(order by s) from generate_series(1,10) s; ratechange ──────────── 101.524 104.570 104.674 106.599 111.685 114.583 120.174 125.561 130.291 merlin
On Tue, 2016-04-26 at 09:13 -0500, Merlin Moncure wrote: > I'm not going to work out the whole problem for you, but you probably > want to start looking in the direction custom aggregate function. > Most people are not aware of this but custom aggregates can be > applied over a window. The state can be a composite type and thus > hold arbitrarily complex information that you can maintain as you > walk through the set (this works great for time series and things > like that). > > Barring that, if you are still looking for solutions to this type of > problem in the database, I'd point you in the direction of pl/r. The > learning curve is somewhat steep but rewards mastery. > > CREATE AGGREGATE RateChange(NUMERIC, NUMERIC, INT) > ( > SFUNC = AggRateChange, > STYPE = NUMERIC > ); > > CREATE OR REPLACE FUNCTION AggRateChange( > State NUMERIC, > Seed NUMERIC, > RateChange NUMERIC, > NDigits INT) RETURNS NUMERIC AS > $$ > SELECT ROUND(COALESCE($1, $2) * (1.0 + $3), NDigits); > $$ LANGUAGE SQL IMMUTABLE; > > select ratechange(100, (random() * 0.05)::numeric, 3) over(order by > s) > from generate_series(1,10) s; > ratechange > ──────────── > 101.524 > 104.570 > 104.674 > 106.599 > 111.685 > 114.583 > 120.174 > 125.561 > 130.291 Hey Merlin, Thank you for your thoughtful suggestions. I spent the better part of the day wrestling with some refactored schema and queries with a great deal of help from the kind folks in #postgresql. This is what I've got now, albeit without custom aggregate functions: http://pastebin.com/zwWmYMd0 As you can see, it does appear to work, but it's a problem of doing so efficiently now. The solution I have around line 192 to 208 performs total_diff() for a given id against all others, but with some constants removed. It does work, but for a statistic table of 1,000,000 rows and a statistics_child table of 101,948,612, the query takes approximately 4m 11s on my machine. I'd like to improve this. I wonder if providing indexes on some of the columns in the two views would be helpful? -- Kip Warner -- Senior Software Engineer OpenPGP encrypted/signed mail preferred http://www.thevertigo.com