Обсуждение: Tricky math in PSQL

Поиск
Список
Период
Сортировка

Tricky math in PSQL

От
Kip Warner
Дата:
Hey list,

I am having problems trying to figure out the best way to implement a
complicated function in PostgreSQL. I have tried my best to articulate
my problem here:

    http://pastebin.com/zwWmYMd0

To understand it mathematically is not so hard, but to express it in
PSQL syntax is hard for me.

I am aware that the views and functions I have may not be the most
efficient. I am open to suggestions for those too.

--
Kip Warner -- Senior Software Engineer
OpenPGP encrypted/signed mail preferred
http://www.thevertigo.com


Вложения

Re: Tricky math in PSQL

От
Kevin Grittner
Дата:
On Fri, Apr 22, 2016 at 1:44 PM, Kip Warner <kip@thevertigo.com> wrote:

> I am having problems trying to figure out the best way to implement a
> complicated function in PostgreSQL. I have tried my best to articulate
> my problem here:
>
>     http://pastebin.com/zwWmYMd0

Please include such information in the post, either directly or as
an attachment.  Messages on these lists go back over 18 years at
this point, and while you did flag that paste as "never delete",
who's to say whether pastebin will be around to keep the promise
that the URL will be valid 18 years from now?

> To understand it mathematically is not so hard, but to express it in
> PSQL syntax is hard for me.
>
> I am aware that the views and functions I have may not be the most
> efficient. I am open to suggestions for those too.

People can best help you if you provide a self-contained test case
-- on that actually runs from an empty database.  You will get more
responses, and are more likely to get to a good solution, if you
simplify the example as far as possible while still showing the
basic problem -- without ellipses or other non-executable visual
cues.

You may want to review this page and post again:

https://wiki.postgresql.org/wiki/SlowQueryQuestions

Also, keep in mind that plpgsql is not the only language available
within PostgreSQL; for statistics you might, for example, consider
plr.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Tricky math in PSQL

От
Kip Warner
Дата:
On Fri, 2016-04-22 at 14:37 -0500, Kevin Grittner wrote:
> Please include such information in the post, either directly or as
> an attachment.  Messages on these lists go back over 18 years at
> this point, and while you did flag that paste as "never delete",
> who's to say whether pastebin will be around to keep the promise
> that the URL will be valid 18 years from now?

Hey Kevin,

Noted. In general you are right. Because these queries are a bit long
and sometimes wrap in ugly ways in folks' MUAs making them harder to
read, you'll have to forgive me for this one.

> People can best help you if you provide a self-contained test case
> -- on that actually runs from an empty database.  You will get more
> responses, and are more likely to get to a good solution, if you
> simplify the example as far as possible while still showing the
> basic problem -- without ellipses or other non-executable visual
> cues.

Ok, it took me a while to figure out how to populate the tables with
sample random data, but I believe I have a complete self-contained
example that executes.

    http://pastebin.com/zwWmYMd0

The two tables statistics and statistics_child are populated with a
large number of rows.

Thank you for your help, Kevin.

--
Kip Warner -- Senior Software Engineer
OpenPGP encrypted/signed mail preferred
http://www.thevertigo.com


Вложения

Re: Tricky math in PSQL

От
Merlin Moncure
Дата:
On Mon, Apr 25, 2016 at 8:46 PM, Kip Warner <kip@thevertigo.com> wrote:
> On Fri, 2016-04-22 at 14:37 -0500, Kevin Grittner wrote:
>> Please include such information in the post, either directly or as
>> an attachment.  Messages on these lists go back over 18 years at
>> this point, and while you did flag that paste as "never delete",
>> who's to say whether pastebin will be around to keep the promise
>> that the URL will be valid 18 years from now?
>
> Hey Kevin,
>
> Noted. In general you are right. Because these queries are a bit long
> and sometimes wrap in ugly ways in folks' MUAs making them harder to
> read, you'll have to forgive me for this one.

No, we don't :-).    Anyways, the relevant portion is here.

-- I would like to know the most efficient way of calculating the following,
--  preferably within the DBMS, to avoid having to shift intermediate
--  calculations back and forth between the client application. It is easy to
--  write and understand mathematically, but translating into PSQL is tricky...

--total_diff(id_a, id_b) =
--    (statistic_constant * statistic_diff(id_a, id_b)) +
--    (statistic_child_constant *
--        sigma(
--            sort_ordinal=1,
--            sort_ordinal=
--                min(COUNT(statistics_child.id=id_a),
COUNT(statistics_child.id=id_b)),
--            statistic_child_diff(id_a, id_b, sort_ordinal)) / constantG))

-- ...the sigma sums all statistic_child_diff() for the two id's to compare
-- for sort_ordinal 1,2,3,...to N where N is the lesser of the number of rows in
-- either statistics_child with id_a or id_b.

I'm not going to work out the whole problem for you, but you probably
want to start looking in the direction custom aggregate function.
Most people are not aware of this but custom aggregates can be applied
over a window.  The state can be a composite type and thus hold
arbitrarily complex information that you can maintain as you walk
through the set (this works great for time series and things like
that).

Barring that, if you are still looking for solutions to this type of
problem in the database, I'd point you in the direction of pl/r.  The
learning curve is somewhat steep but rewards mastery.

CREATE AGGREGATE RateChange(NUMERIC, NUMERIC, INT)
(
  SFUNC = AggRateChange,
  STYPE = NUMERIC
);

CREATE OR REPLACE FUNCTION AggRateChange(
  State NUMERIC,
  Seed NUMERIC,
  RateChange NUMERIC,
  NDigits INT) RETURNS NUMERIC AS
$$
  SELECT ROUND(COALESCE($1, $2) * (1.0 + $3), NDigits);
$$ LANGUAGE SQL IMMUTABLE;

select ratechange(100, (random() * 0.05)::numeric, 3) over(order by s)
from generate_series(1,10) s;
 ratechange
────────────
    101.524
    104.570
    104.674
    106.599
    111.685
    114.583
    120.174
    125.561
    130.291


merlin


Re: Tricky math in PSQL

От
Kip Warner
Дата:
On Tue, 2016-04-26 at 09:13 -0500, Merlin Moncure wrote:
> I'm not going to work out the whole problem for you, but you probably
> want to start looking in the direction custom aggregate function.
> Most people are not aware of this but custom aggregates can be
> applied over a window.  The state can be a composite type and thus
> hold arbitrarily complex information that you can maintain as you
> walk through the set (this works great for time series and things
> like that).
>
> Barring that, if you are still looking for solutions to this type of
> problem in the database, I'd point you in the direction of pl/r.  The
> learning curve is somewhat steep but rewards mastery.
>
> CREATE AGGREGATE RateChange(NUMERIC, NUMERIC, INT)
> (
>   SFUNC = AggRateChange,
>   STYPE = NUMERIC
> );
>
> CREATE OR REPLACE FUNCTION AggRateChange(
>   State NUMERIC,
>   Seed NUMERIC,
>   RateChange NUMERIC,
>   NDigits INT) RETURNS NUMERIC AS
> $$
>   SELECT ROUND(COALESCE($1, $2) * (1.0 + $3), NDigits);
> $$ LANGUAGE SQL IMMUTABLE;
>
> select ratechange(100, (random() * 0.05)::numeric, 3) over(order by
> s)
> from generate_series(1,10) s;
>  ratechange
> ────────────
>     101.524
>     104.570
>     104.674
>     106.599
>     111.685
>     114.583
>     120.174
>     125.561
>     130.291

Hey Merlin,

Thank you for your thoughtful suggestions. I spent the better part of
the day wrestling with some refactored schema and queries with a great
deal of help from the kind folks in #postgresql. This is what I've got
now, albeit without custom aggregate functions:

    http://pastebin.com/zwWmYMd0

As you can see, it does appear to work, but it's a problem of doing so
efficiently now.  The solution I have around line 192 to 208 performs
total_diff() for a given id against all others, but with some constants
removed. It does work, but for a statistic table of 1,000,000 rows and
a statistics_child table of 101,948,612, the query takes approximately
4m 11s on my machine.

I'd like to improve this. I wonder if providing indexes on some of the
columns in the two views would be helpful?

--
Kip Warner -- Senior Software Engineer
OpenPGP encrypted/signed mail preferred
http://www.thevertigo.com


Вложения