Обсуждение: BUG #10256: COUNT(*) behaves sort of like RANK() when used over a window containing an ORDER BY

Поиск
Список
Период
Сортировка

BUG #10256: COUNT(*) behaves sort of like RANK() when used over a window containing an ORDER BY

От
arnaud.mouronval@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      10256
Logged by:          Arnaud Mouronval
Email address:      arnaud.mouronval@gmail.com
PostgreSQL version: 9.3.4
Operating system:   Windows 8.1
Description:

I discovered a problem while using a window that used an ORDER BY clause,
and using this window with ROW_NUMBER() and COUNT(*) at the same time.

Here is a short SQL script to replicate it :

DROP TABLE IF EXISTS tmp_count_window_bug_data;
CREATE TABLE tmp_count_window_bug_data (c1 character varying(8), c2
character varying(8));
INSERT INTO tmp_count_window_bug_data (c1, c2)
VALUES
    ('A', 'AA'),
    ('A', 'AB'),
    ('B', 'BA'),
    ('B', 'BB'),
    ('B', 'BC'),
    ('B', 'BC'),
    ('B', 'BD');

SELECT
    c1,
    COUNT(*) OVER(PARTITION BY c1),
    COUNT(*) OVER(PARTITION BY c1 ORDER BY c2)
FROM tmp_count_window_bug_data;

Result on my machine :
"A";2;1
"A";2;2
"B";5;1
"B";5;2
"B";5;4
"B";5;4
"B";5;5

I was expecting to get the same values in the last 2 columns.
As you can see, the third column looks much more like a RANK() (except
RANK() would have answered 3 instead of 4 for 2 lines).
arnaud.mouronval@gmail.com writes:
> I discovered a problem while using a window that used an ORDER BY clause,
> and using this window with ROW_NUMBER() and COUNT(*) at the same time.

> Here is a short SQL script to replicate it :

> DROP TABLE IF EXISTS tmp_count_window_bug_data;
> CREATE TABLE tmp_count_window_bug_data (c1 character varying(8), c2
> character varying(8));
> INSERT INTO tmp_count_window_bug_data (c1, c2)
> VALUES
>     ('A', 'AA'),
>     ('A', 'AB'),
>     ('B', 'BA'),
>     ('B', 'BB'),
>     ('B', 'BC'),
>     ('B', 'BC'),
>     ('B', 'BD');

> SELECT
>     c1,
>     COUNT(*) OVER(PARTITION BY c1),
>     COUNT(*) OVER(PARTITION BY c1 ORDER BY c2)
> FROM tmp_count_window_bug_data;

> Result on my machine :
> "A";2;1
> "A";2;2
> "B";5;1
> "B";5;2
> "B";5;4
> "B";5;4
> "B";5;5

This looks correct to me.  Keep in mind that the default window framing
clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which really
means "all rows up to and including the last sort peer of the current row".
With no ORDER BY clause, all rows in the partition are peers, so you get a
count covering the whole partition no matter which row you are on.  With
an ORDER BY, you get a count of only the rows up to the current row (or
its last peer, when there are duplicate values in the ORDER BY columns).

> I was expecting to get the same values in the last 2 columns.

For that you'd want RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING.

            regards, tom lane
Tom Lane-2 wrote
> arnaud.mouronval@

>  writes:
>> I discovered a problem while using a window that used an ORDER BY clause,
>> and using this window with ROW_NUMBER() and COUNT(*) at the same time.
>
> This looks correct to me.  Keep in mind that the default window framing
> clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which really
> means "all rows up to and including the last sort peer of the current
> row".
> With no ORDER BY clause, all rows in the partition are peers, so you get a
> count covering the whole partition no matter which row you are on.  With
> an ORDER BY, you get a count of only the rows up to the current row (or
> its last peer, when there are duplicate values in the ORDER BY columns).
>
>> I was expecting to get the same values in the last 2 columns.
>
> For that you'd want RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
> FOLLOWING.

Tom's quote is pretty much an abbreviation of the following taken directly
from:
http://www.postgresql.org/docs/9.3/interactive/sql-select.html

"The default framing option is RANGE UNBOUNDED PRECEDING, which is the same
as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY, this
sets the frame to be all rows from the partition start up through the
current row's last peer. Without ORDER BY, all rows of the partition are
included in the window frame, since all rows become peers of the current
row."

A more useful description and example is found in the tutorial (about
halfway down, the sum(*) example):

http://www.postgresql.org/docs/9.3/interactive/tutorial-window.html

The term "peer" in the first quotation is confusing to me.  My understanding
is that "PARTITION BY" defines which rows are "peers" - even if that isn't
the wording used.  So now using "peers" in relation to the FRAME clause
confuses the issue.  IMO the "since all rows..." is really superfluous - I
would understand:

"In the presence of an ORDER BY only rows up to the current row (including
all [partition] peers) are considered.  Omitting an ORDER BY causes the
entire PARTITION to be considered the FRAME.  Other frame definitions can be
requested by supplying an explicit FRAME clause."

I knew the answer but in the interest of looking through the user's eyes I
wanted to go take a look (again, done this before) and really am not
surprised that these kinds of questions are being asked.  Just having to
scan through 5 sections of the documentation is difficult.  That said, the
tutorial section and inline examples make it quite clear how the
with/without ORDER BY behavior of window functions (sum in the example)
cause different results.

At minimum the top of 9.3.4 could provide links to, and
descriptions/summaries of, what the other 4 sections cover and why things
are broken out the way they are.  The other cross-references could point
back to that section-subsection as a kind of launch point: "Please see
section 3.5.1 for an overview of, and links to, other related sections."

Just some food for thought if anyone is industrious and annoyed enough to
act on it.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/BUG-10256-COUNT-behaves-sort-of-like-RANK-when-used-over-a-window-containing-an-ORDER-BY-tp5803073p5803103.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.
David G Johnston <david.g.johnston@gmail.com> writes:
> The term "peer" in the first quotation is confusing to me.  My understanding
> is that "PARTITION BY" defines which rows are "peers" - even if that isn't
> the wording used.  So now using "peers" in relation to the FRAME clause
> confuses the issue.

AFAIK we've only used "peers" in this context to mean "rows with equal
sort-column values".  I don't think we have a specific term for "rows
appearing in the same partition", but certainly neither the docs nor the
code mean that when they say "peer".

[ looks at SQL standard... ]  The standard uses "peer" in this way too,
so that's where we got the term from.  Because of that, I'm unwilling
to adopt your suggestion of thinking that "peer" means "member of the
same partition".  However, it seems like maybe we need to clarify the
term some more, eg define what we mean by it in more places.  Are there
any specific places that you think this should be done?

> At minimum the top of 9.3.4 could provide links to, and
> descriptions/summaries of, what the other 4 sections cover and why things
> are broken out the way they are.  The other cross-references could point
> back to that section-subsection as a kind of launch point: "Please see
> section 3.5.1 for an overview of, and links to, other related sections."

No particular objection to doing something like that.

> Just some food for thought if anyone is industrious and annoyed enough to
> act on it.

Not me, at least not in the near future.

            regards, tom lane
On Wed, May 7, 2014 at 8:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> David G Johnston <david.g.johnston@gmail.com> writes:
> > The term "peer" in the first quotation is confusing to me.  My
> understanding
> > is that "PARTITION BY" defines which rows are "peers" - even if that
> isn't
> > the wording used.  So now using "peers" in relation to the FRAME clause
> > confuses the issue.
>
> AFAIK we've only used "peers" in this context to mean "rows with equal
> sort-column values".  I don't think we have a specific term for "rows
> appearing in the same partition", but certainly neither the docs nor the
> code mean that when they say "peer".
>
> [ looks at SQL standard... ]  The standard uses "peer" in this way too,
> so that's where we got the term from.  Because of that, I'm unwilling
> to adopt your suggestion of thinking that "peer" means "member of the
> same partition".  However, it seems like maybe we need to clarify the
> term some more, eg define what we mean by it in more places.  Are there
> any specific places that you think this should be done?
>
>
I'm pretty sure I simply considered table rows falling into the same
partition as "peers" without anything actually saying that - and for lack
of a better term.  So when I read the quoted paragraph in took some mental
processing to put it in context.

I guess rows falling into the same partition could be deemed "member" rows;
as in having membership in the partition.  Then two members are of equal
stature (rank) UNLESS other attributes, defined in the ORDER BY, cause them
to be ranked, in which case members having equal ranking according to the
ORDER BY (i.e. identical values for all specified columns) are considered
peers.  The default FRAME clause then only considers members of equal and
lesser rank relative to the current member.=E2=80=8B  You can provide an ex=
plicit
FRAME clause to look both ahead and behind the current member's ranking
group.

Would need to see where to stick something like that but at least we would
have two terms: member and peer matching with PARTITION and ORDER
respectively.

Does the standard provide a word for tuples that get placed into the same
partition?

David J.
David Johnston <david.g.johnston@gmail.com> writes:
> On Wed, May 7, 2014 at 8:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> [ looks at SQL standard... ]  The standard uses "peer" in this way too,
>> so that's where we got the term from.  Because of that, I'm unwilling
>> to adopt your suggestion of thinking that "peer" means "member of the
>> same partition".

> I guess rows falling into the same partition could be deemed "member" rows;
> as in having membership in the partition.

Works for me.

> Does the standard provide a word for tuples that get placed into the same
> partition?

Not that I noticed, but I didn't search hard.

The index of SQL:2011 has one entry for "peer", pointing to this
definition under 10.10 <sort specification list>:

i) Two rows that are not distinct with respect to the <sort
   specification>s are said to be peers of each other. The relative
   ordering of peers is implementation-dependent.

so in their usage it's not even specific to windows.  The terminology
for windows seems to be mostly defined in 4.15.14, and I don't see
a term in there for the rows belonging to a partition.

            regards, tom lane

Re: Re: BUG #10256: COUNT(*) behaves sort of like RANK() when used over a window containing an ORDER BY

От
Bruce Momjian
Дата:
On Wed, May  7, 2014 at 04:08:23PM -0700, David G Johnston wrote:
> Tom's quote is pretty much an abbreviation of the following taken directly
> from:
> http://www.postgresql.org/docs/9.3/interactive/sql-select.html
>
> "The default framing option is RANGE UNBOUNDED PRECEDING, which is the same
> as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY, this
> sets the frame to be all rows from the partition start up through the
> current row's last peer. Without ORDER BY, all rows of the partition are
> included in the window frame, since all rows become peers of the current
> row."
>
> A more useful description and example is found in the tutorial (about
> halfway down, the sum(*) example):
>
> http://www.postgresql.org/docs/9.3/interactive/tutorial-window.html
>
> The term "peer" in the first quotation is confusing to me.  My understanding
> is that "PARTITION BY" defines which rows are "peers" - even if that isn't
> the wording used.  So now using "peers" in relation to the FRAME clause
> confuses the issue.  IMO the "since all rows..." is really superfluous - I
> would understand:
>
> "In the presence of an ORDER BY only rows up to the current row (including
> all [partition] peers) are considered.  Omitting an ORDER BY causes the
> entire PARTITION to be considered the FRAME.  Other frame definitions can be
> requested by supplying an explicit FRAME clause."
>
> I knew the answer but in the interest of looking through the user's eyes I
> wanted to go take a look (again, done this before) and really am not
> surprised that these kinds of questions are being asked.  Just having to
> scan through 5 sections of the documentation is difficult.  That said, the
> tutorial section and inline examples make it quite clear how the
> with/without ORDER BY behavior of window functions (sum in the example)
> cause different results.
>
> At minimum the top of 9.3.4 could provide links to, and
> descriptions/summaries of, what the other 4 sections cover and why things
> are broken out the way they are.  The other cross-references could point
> back to that section-subsection as a kind of launch point: "Please see
> section 3.5.1 for an overview of, and links to, other related sections."

Based on this discussion, I developed the attached patch which qualifies
"peer" in every place it is mentioned --- this should reduce future
confusion.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Вложения

Re: Re: BUG #10256: COUNT(*) behaves sort of like RANK() when used over a window containing an ORDER BY

От
Bruce Momjian
Дата:
On Wed, Sep  3, 2014 at 06:14:57PM -0400, Bruce Momjian wrote:
> > At minimum the top of 9.3.4 could provide links to, and
> > descriptions/summaries of, what the other 4 sections cover and why things
> > are broken out the way they are.  The other cross-references could point
> > back to that section-subsection as a kind of launch point: "Please see
> > section 3.5.1 for an overview of, and links to, other related sections."
>
> Based on this discussion, I developed the attached patch which qualifies
> "peer" in every place it is mentioned --- this should reduce future
> confusion.

Patch applied and backpatched to 9.4.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +