Re: Measure Theoretic Data Types in Postgresql

Поиск
Список
Период
Сортировка
От Aaron Sheldon
Тема Re: Measure Theoretic Data Types in Postgresql
Дата
Msg-id CADyahXbNVZkYv7+r3z-76Fce-xY8Yjeucrc1v_4bZ_=dOfigSQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Measure Theoretic Data Types in Postgresql  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы Re: Measure Theoretic Data Types in Postgresql
Список pgsql-hackers
So the key algorithmic inefficient is the inner join on the generated series. Worst case scenario this compares every range to every date in the series, which for m ranges and n dates yields O(m*n) operations. The analysts in my shop currently write queries like this for billions of records against thousands of dates and then go and take 8 hour coffee breaks.

However, by realizing that the bounds on the ranges have a linear ordering one can speed this up to 0(m) using windowing functions on common table expressions.

So what I am proposing is formalizing this optimization into a class of data types, that will hide the implementation details.



On Fri, Oct 12, 2012 at 1:48 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 11.10.2012 07:37, Aaron Sheldon wrote:
This would allow for a succinct syntax to do calculations such as
finding the daily unique patient count given the intervals of their
attendance in particular programs; a computation I encounter
routinely as a statistician for a health services provider.

Hmm. It's easy to get the count of unique patients on a particular date with something like:

select count(distinct patient) from attendance where interval && '2012-10-12'::date

I guess what you're after is to get that count for a range of days, in one query, so that the result looks something like this:

   date    |  patients
-----------+------------
2012-10-05 |       20
2012-10-06 |       24
2012-10-07 |       30
2012-10-08 |       29

The way I think of that problem is that you need to join the dates you're interested in with the attendance table.

select date, count (distinct patientid)
from attendance
inner join (
  select '2012-10-04'::date + a AS date from generate_series(1,20) a
) dates on interval @> date
group by date;
    date    | count
------------+-------
 2012-10-05 |    11
 2012-10-06 |    27
 2012-10-07 |    47
 2012-10-08 |    63
 2012-10-09 |    83
 2012-10-10 |    95
 2012-10-11 |    80
 2012-10-12 |    60
 2012-10-13 |    35
 2012-10-14 |    13
(10 rows)

I created the test table for that with:

create table attendance (patientid int4 , interval daterange)
insert into attendance select id, daterange('2012-10-05'::date + (random()*5)::int4, '2012-10-10'::date + (random()*5)::int4) from generate_series(1,100) id;


So, I think the current range types already cover that use case pretty well. I can't imagine how the proposed measure theoretic concepts would make that simpler. Can you give some more complicated problem, perhaps, that the proposed measure theoretic concepts would make simpler than the current tools?

- Heikki



--
Aaron Sheldon

#67 Westedge Townhouses 
5019 46 Ave, SW
Calgary AB, T3E 6R1

(h) 1.403.453.6316
(c) 1.403.710.9357
aaron.sheldon@gmail.com

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: proposal - assign result of query to psql variable
Следующее
От: Pavel Stehule
Дата:
Сообщение: patch: assign result of query to psql variable