Обсуждение: crosstab/repivot...any interest?

Поиск
Список
Период
Сортировка

crosstab/repivot...any interest?

От
Merlin Moncure
Дата:
Hackers,

We have a strong need to make a variant to the crosstab interface so
that data that is pivoted one way would be sent through a crosstab
like function so that it would be pivoted another way.  For example,
if you had

row 0: a1, a2, a3, k1, c1, c2, ...
row 1: a1, a2, a3, k2, c1, c2, ...
row 2: a1, a2, a3, k3, c1, c2, ...
...

where 'a' columns are uninteresting attribute columns, 'k' is the
dimension we want to pivot on, and c1->cN would be stacked vertically,
so that we'd end up with,
row 0: a1, a2, a3, c1, k1, k2, ...
row 1: a1, a2, a3, c2, k1, k2, ...
row 2: a1, a2, a3, c3, k1, k2, ...

There are various SQL level approaches to this but they tend to be
imperformant with large datasets so that I think a crosstab-like C
implementation ought to be able to do better (or at least I hope so)
since you have to cross product rows and columns in such a way that
you can get a clean join.  Cribbing from tablefunc.c I don't think
this is a terrible challenge to do in hash table style.

Questions on the table:
*) Has anyone done anything like this or know of any current implementations?
*) Would there be any interest in expanding tablefunc along these lines?

thanks in advance,
merlin


Re: crosstab/repivot...any interest?

От
David Fetter
Дата:
On Fri, Jan 25, 2019 at 02:21:55PM -0600, Merlin Moncure wrote:
> Hackers,
> 
> We have a strong need to make a variant to the crosstab interface so
> that data that is pivoted one way would be sent through a crosstab
> like function so that it would be pivoted another way.  For example,
> if you had
> 
> row 0: a1, a2, a3, k1, c1, c2, ...
> row 1: a1, a2, a3, k2, c1, c2, ...
> row 2: a1, a2, a3, k3, c1, c2, ...
> ...
> 
> where 'a' columns are uninteresting attribute columns, 'k' is the
> dimension we want to pivot on, and c1->cN would be stacked vertically,
> so that we'd end up with,
> row 0: a1, a2, a3, c1, k1, k2, ...
> row 1: a1, a2, a3, c2, k1, k2, ...
> row 2: a1, a2, a3, c3, k1, k2, ...
> 
> There are various SQL level approaches to this but they tend to be
> imperformant with large datasets so that I think a crosstab-like C
> implementation ought to be able to do better (or at least I hope so)
> since you have to cross product rows and columns in such a way that
> you can get a clean join.  Cribbing from tablefunc.c I don't think
> this is a terrible challenge to do in hash table style.
> 
> Questions on the table:
> *) Has anyone done anything like this or know of any current implementations?
> *) Would there be any interest in expanding tablefunc along these lines?

There's something in SQL:2016 that I read as crosstabs, or at least as
enabling crosstabs.
https://www.iso.org/standard/69776.html

If we're going to put work into crosstabs, it seems to me that the
"we" needs to be the project as a whole, and the work should be, to
the extent reasonable, toward standard compliance.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: crosstab/repivot...any interest?

От
Merlin Moncure
Дата:
On Fri, Jan 25, 2019 at 3:16 PM David Fetter <david@fetter.org> wrote:
>
> On Fri, Jan 25, 2019 at 02:21:55PM -0600, Merlin Moncure wrote:
> > Hackers,
> >
> > We have a strong need to make a variant to the crosstab interface so
> > that data that is pivoted one way would be sent through a crosstab
> > like function so that it would be pivoted another way.  For example,
> > if you had
> >
> > row 0: a1, a2, a3, k1, c1, c2, ...
> > row 1: a1, a2, a3, k2, c1, c2, ...
> > row 2: a1, a2, a3, k3, c1, c2, ...
> > ...
> >
> > where 'a' columns are uninteresting attribute columns, 'k' is the
> > dimension we want to pivot on, and c1->cN would be stacked vertically,
> > so that we'd end up with,
> > row 0: a1, a2, a3, c1, k1, k2, ...
> > row 1: a1, a2, a3, c2, k1, k2, ...
> > row 2: a1, a2, a3, c3, k1, k2, ...
> >
> > There are various SQL level approaches to this but they tend to be
> > imperformant with large datasets so that I think a crosstab-like C
> > implementation ought to be able to do better (or at least I hope so)
> > since you have to cross product rows and columns in such a way that
> > you can get a clean join.  Cribbing from tablefunc.c I don't think
> > this is a terrible challenge to do in hash table style.
> >
> > Questions on the table:
> > *) Has anyone done anything like this or know of any current implementations?
> > *) Would there be any interest in expanding tablefunc along these lines?
>
> There's something in SQL:2016 that I read as crosstabs, or at least as
> enabling crosstabs.
> https://www.iso.org/standard/69776.html
>
> If we're going to put work into crosstabs, it seems to me that the
> "we" needs to be the project as a whole, and the work should be, to
> the extent reasonable, toward standard compliance.

Interesting.  Do you see that the spec (it makes my brain hurt) can
handle that kind of repivoting?

merlin


Re: crosstab/repivot...any interest?

От
David Fetter
Дата:
On Fri, Jan 25, 2019 at 04:31:00PM -0600, Merlin Moncure wrote:
> On Fri, Jan 25, 2019 at 3:16 PM David Fetter <david@fetter.org> wrote:
> >
> > On Fri, Jan 25, 2019 at 02:21:55PM -0600, Merlin Moncure wrote:
> > > Hackers,
> > >
> > > We have a strong need to make a variant to the crosstab interface so
> > > that data that is pivoted one way would be sent through a crosstab
> > > like function so that it would be pivoted another way.  For example,
> > > if you had
> > >
> > > row 0: a1, a2, a3, k1, c1, c2, ...
> > > row 1: a1, a2, a3, k2, c1, c2, ...
> > > row 2: a1, a2, a3, k3, c1, c2, ...
> > > ...
> > >
> > > where 'a' columns are uninteresting attribute columns, 'k' is the
> > > dimension we want to pivot on, and c1->cN would be stacked vertically,
> > > so that we'd end up with,
> > > row 0: a1, a2, a3, c1, k1, k2, ...
> > > row 1: a1, a2, a3, c2, k1, k2, ...
> > > row 2: a1, a2, a3, c3, k1, k2, ...
> > >
> > > There are various SQL level approaches to this but they tend to be
> > > imperformant with large datasets so that I think a crosstab-like C
> > > implementation ought to be able to do better (or at least I hope so)
> > > since you have to cross product rows and columns in such a way that
> > > you can get a clean join.  Cribbing from tablefunc.c I don't think
> > > this is a terrible challenge to do in hash table style.
> > >
> > > Questions on the table:
> > > *) Has anyone done anything like this or know of any current implementations?
> > > *) Would there be any interest in expanding tablefunc along these lines?
> >
> > There's something in SQL:2016 that I read as crosstabs, or at least as
> > enabling crosstabs.
> > https://www.iso.org/standard/69776.html
> >
> > If we're going to put work into crosstabs, it seems to me that the
> > "we" needs to be the project as a whole, and the work should be, to
> > the extent reasonable, toward standard compliance.
> 
> Interesting.  Do you see that the spec (it makes my brain hurt) can
> handle that kind of repivoting?

I believe the constructs can nest and/or refer to each other, so yes.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: crosstab/repivot...any interest?

От
Morris de Oryx
Дата:
Hello, I'm not a C coder and can't help....but I love cross-tab/pivot tables. They're the best, and just fantastic for preparing data to feed into various analysis tools. The tablefunc module is helpful, but a bit awkward to use (as I remember it.)

From a user's point of view, I high-performance cross-tab generator would be just fantastic. 

As I understand it, this is what's involved in a pivot:

1. Identify rows that should be grouped (consolidated.)
2. Distinguish the value that identifies each derived column.
3. Distinguish the value that identifies each row-column value.
4. Collapse the rows, build the columns, and populate the 'cells' with data.

In an ideal world, you would be able to perform different grouping operations. Such as count, sum, avg, etc.

If there's a way to do this in a system-wide and standards-pointing way, so much the better.

Apologies if I'm violating list etiquette by jumping in here. I've been lurking on several Postgres lists for a bit and picking up interesting details every day. If I've been Unintentionally and Cluelessly Off, I'm find with being told.


On Sat, Jan 26, 2019 at 10:49 AM David Fetter <david@fetter.org> wrote:
On Fri, Jan 25, 2019 at 04:31:00PM -0600, Merlin Moncure wrote:
> On Fri, Jan 25, 2019 at 3:16 PM David Fetter <david@fetter.org> wrote:
> >
> > On Fri, Jan 25, 2019 at 02:21:55PM -0600, Merlin Moncure wrote:
> > > Hackers,
> > >
> > > We have a strong need to make a variant to the crosstab interface so
> > > that data that is pivoted one way would be sent through a crosstab
> > > like function so that it would be pivoted another way.  For example,
> > > if you had
> > >
> > > row 0: a1, a2, a3, k1, c1, c2, ...
> > > row 1: a1, a2, a3, k2, c1, c2, ...
> > > row 2: a1, a2, a3, k3, c1, c2, ...
> > > ...
> > >
> > > where 'a' columns are uninteresting attribute columns, 'k' is the
> > > dimension we want to pivot on, and c1->cN would be stacked vertically,
> > > so that we'd end up with,
> > > row 0: a1, a2, a3, c1, k1, k2, ...
> > > row 1: a1, a2, a3, c2, k1, k2, ...
> > > row 2: a1, a2, a3, c3, k1, k2, ...
> > >
> > > There are various SQL level approaches to this but they tend to be
> > > imperformant with large datasets so that I think a crosstab-like C
> > > implementation ought to be able to do better (or at least I hope so)
> > > since you have to cross product rows and columns in such a way that
> > > you can get a clean join.  Cribbing from tablefunc.c I don't think
> > > this is a terrible challenge to do in hash table style.
> > >
> > > Questions on the table:
> > > *) Has anyone done anything like this or know of any current implementations?
> > > *) Would there be any interest in expanding tablefunc along these lines?
> >
> > There's something in SQL:2016 that I read as crosstabs, or at least as
> > enabling crosstabs.
> > https://www.iso.org/standard/69776.html
> >
> > If we're going to put work into crosstabs, it seems to me that the
> > "we" needs to be the project as a whole, and the work should be, to
> > the extent reasonable, toward standard compliance.
>
> Interesting.  Do you see that the spec (it makes my brain hurt) can
> handle that kind of repivoting?

I believe the constructs can nest and/or refer to each other, so yes.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

crosstab/repivot...any interest?

От
Merlin Moncure
Дата:
On Fri, Jan 25, 2019 at 9:14 PM Morris de Oryx <morrisdeoryx@gmail.com> wrote:
>
> Hello, I'm not a C coder and can't help....but I love cross-tab/pivot tables. They're the best, and just fantastic for preparing data to feed into various analysis tools. The tablefunc module is helpful, but a bit awkward to use (as I remember it.)
>
> From a user's point of view, I high-performance cross-tab generator would be just fantastic.
>
> As I understand it, this is what's involved in a pivot:
>
> 1. Identify rows that should be grouped (consolidated.)
> 2. Distinguish the value that identifies each derived column.
> 3. Distinguish the value that identifies each row-column value.
> 4. Collapse the rows, build the columns, and populate the 'cells' with data.
>
> In an ideal world, you would be able to perform different grouping operations. Such as count, sum, avg, etc.
>
> If there's a way to do this in a system-wide and standards-pointing way, so much the better.
>
> Apologies if I'm violating list etiquette by jumping in here. I've been lurking on several Postgres lists for a bit and picking up interesting details every day. If I've been Unintentionally and Cluelessly Off, I'm find with being told.

No worries, sir! Apologies on the late reply.  I've made some headway on this item.  Waiting for postgres to implement the SQL standard pivoting (regardless if it implements the cases I need) is not an option for my personal case. I can't use the SQL approach either as it's very slow and imposing some scaling limits that need to work around in the short run.

My strategy is to borrow [steal] from crosstab_hash and make a new version called repivot which takes an arleady pivoted data set and repivots it against an identified column.   Most of the code can be shared with tablefunc so ideally this could live as an amendment to that extension.  That might not happen though, so I'm going to package it as a separate extension (removing the majority of tablefunc that I don't need) and submit it to this group for consideration.

If we punt, it'll end up as a private extension or live the life of an Orphan in pgxn.  If there's some interest here, we can consider a new contrib extension (which I personally rate very unlikely) or recasting as an extra routine to tablefunc.  Any way we slice it, huge thanks to Joe Conway&co for giving us such an awesome function to work with all these years (not to mention the strategic plr language).  SRF crosstab() is still somewhat baroque, but it still fills a niche that nothing else implements.

The interface I'm looking at is:
SELECT repivot(
  query TEXT,
  static_attributes INT,  /* number of static attributes that are unchanged around key; we need this in our usages */
  attribute_query  TEXT); /* query that draws up the pivoted attribute list */

The driving query is expected to return 0+ static attributes which are essentially simply pasted to the output. The next two columns are the key column and the pivoting column.   So if you had three attributes, the input set would be:

a1, a2, a3, k1, p, v1...vn

Where the coordinates v and p would exchange.  I need to get this done quickly and so am trying to avoid more abstracted designs (maybe multi part keys should be supported through...this is big limitation of crosstab albeit with some obnoxious work arounds).

merlin

Re: crosstab/repivot...any interest?

От
Joe Conway
Дата:
On 2/25/19 8:34 PM, Merlin Moncure wrote:
> No worries, sir! Apologies on the late reply.  I've made some headway on
> this item.  Waiting for postgres to implement the SQL standard pivoting
> (regardless if it implements the cases I need) is not an option for my
> personal case. I can't use the SQL approach either as it's very slow and
> imposing some scaling limits that need to work around in the short run.
>
> My strategy is to borrow [steal] from crosstab_hash and make a new
> version called repivot which takes an arleady pivoted data set and
> repivots it against an identified column.   Most of the code can be
> shared with tablefunc so ideally this could live as an amendment to that
> extension.  That might not happen though, so I'm going to package it as
> a separate extension (removing the majority of tablefunc that I don't
> need) and submit it to this group for consideration.


I can't promise there will be consensus to add to tablefunc, but I am
not opposed and will be happy to try to help you make that happen to the
extent I can find the spare cycles.

> If we punt, it'll end up as a private extension or live the life of an
> Orphan in pgxn.  If there's some interest here, we can consider a new
> contrib extension (which I personally rate very unlikely) or recasting
> as an extra routine to tablefunc.  Any way we slice it, huge thanks to
> Joe Conway&co for giving us such an awesome function to work with all
> these years (not to mention the strategic plr language).  SRF crosstab()
> is still somewhat baroque, but it still fills a niche that nothing else
> implements.
>
> The interface I'm looking at is:
> SELECT repivot(
>   query TEXT,
>   static_attributes INT,  /* number of static attributes that are
> unchanged around key; we need this in our usages */
>   attribute_query  TEXT); /* query that draws up the pivoted attribute
> list */
>
> The driving query is expected to return 0+ static attributes which are
> essentially simply pasted to the output. The next two columns are the
> key column and the pivoting column.   So if you had three attributes,
> the input set would be:
>
> a1, a2, a3, k1, p, v1...vn
>
> Where the coordinates v and p would exchange.  I need to get this done
> quickly and so am trying to avoid more abstracted designs (maybe multi
> part keys should be supported through...this is big limitation of
> crosstab albeit with some obnoxious work arounds).

Perhaps not enough coffee yet, but I am not sure I fully grok the design
here. A fully fleshed out example would be useful.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development


Вложения

Re: crosstab/repivot...any interest?

От
Merlin Moncure
Дата:
On Tue, Feb 26, 2019 at 8:31 AM Joe Conway <mail@joeconway.com> wrote:
> On 2/25/19 8:34 PM, Merlin Moncure wrote:
> > The interface I'm looking at is:
> > SELECT repivot(
> >   query TEXT,
> >   static_attributes INT,  /* number of static attributes that are
> > unchanged around key; we need this in our usages */
> >   attribute_query  TEXT); /* query that draws up the pivoted attribute
> > list */
> >
> > The driving query is expected to return 0+ static attributes which are
> > essentially simply pasted to the output. The next two columns are the
> > key column and the pivoting column.   So if you had three attributes,
> > the input set would be:
> >
> > a1, a2, a3, k1, p, v1...vn
> >
> > Where the coordinates v and p would exchange.  I need to get this done
> > quickly and so am trying to avoid more abstracted designs (maybe multi
> > part keys should be supported through...this is big limitation of
> > crosstab albeit with some obnoxious work arounds).
>
> Perhaps not enough coffee yet, but I am not sure I fully grok the design
> here. A fully fleshed out example would be useful.

Thanks for your interest.

A lot of our application data is organized in columnar type formats.
Typically, the analytical reports on the back of the database are
organized withe output columns being a KPI or a unit of time.

KPI Columns:
Country,Make,Model,TimeSlice,Sales,Units
US,Ford,Fusion,Y2018Q1,165MM$,250k
US,Ford,Fusion,Y2018Q2,172MM$,261k
US,Ford,Fusion,Y2018Q3,183MM$,268k
...

Time Columns:
Country,Make,Mode, KPI,Y2018Q1,Y2018Q2,Y2018Q3
US,Ford,Fusion,Y2018Q1,Sales,165MM$,172MM$,183MM$
US,Ford,Fusion,Y2018Q2,Units,250k,261k,268k

SELECT repivot(
  <query pulling KPI format>,
  1, /* only one static attribute */
  2, /* number of key columns, only needed if multi part keys are supported */
  <query developing list of time columns>);

In this example supporting multi-part, the repivot column is the 4th,
assumed to be attributes first, keys second, pivot column third, data
block last.  Multi column pivots might be considered but I don't need
them and that's a significant expansion in scope, so I'm avoiding that
consideration.

What we need to do is convert from the first format above (which is
how data is materialized on table) to the second format.  Just like as
within crosstab, if the key column(s) are ordered into the function we
can exploit that structure for an efficient implementation.

'Ford' and 'Fusion' are key columns; 'US' is uninteresting attribute
column (it is unambiguously represented by 'Ford', 'Fusion') and so
would be simply be copied from input to output set.

Our data is materialized in KPI style (which is pretty typical) but we
have large analytical reports that want it with columns representing
units of time.  This is farily common in the industry IMO.

There are various pure SQL approaches to do this but they tend to
force you to build out the X,Y columns and then join it all back
together; this can spiral out of control quite quickly from a
performance standpoint.  A crosstab style crawling cursor over ordered
set ought to get the job done very efficiently.  tablefunc doesn't
support multi part keys today, and the workaround is that you have so
stuff a key into a formatted string and than parse it out for
downstream joining, some needs of having to do that joining might
hopefully be eliminated by allowing the attribute columns to be copied
through.

merlin