Re: [HACKERS] distinct + order by

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: [HACKERS] distinct + order by
Дата
Msg-id 19087.910541797@sss.pgh.pa.us
обсуждение исходный текст
Ответ на distinct + order by  (t-ishii@sra.co.jp (Tatsuo Ishii))
Ответы Re: [HACKERS] distinct + order by
Список pgsql-hackers
t-ishii@sra.co.jp (Tatsuo Ishii) writes:
> 6.4 allows to use ORDER BY with a column that does not appear 
> in a target list. This is great. However, following result seems
> strange to me. Sometimes DISTINCT does not remove duplication.
> Is this normal behavior?

I think there's something funny going on, but it's not clear that
this kind of query is sensible anyway.

It's easy to see what's happening.  DISTINCT relies on sorting by the
"distinct" column first, so that a pass of a uniq(1)-like duplicate
detector can get rid of the duplicates:

tree=> explain select distinct i from t;
NOTICE:  QUERY PLAN:

Unique  (cost=545.65 size=0 width=0) ->  Sort  (cost=545.65 size=0 width=0)       ->  Seq Scan on t  (cost=545.65
size=11262width=4)
 

EXPLAIN

Now, if we add an ORDER BY:

tree=> explain select distinct i from t order by j;
NOTICE:  QUERY PLAN:

Unique  (cost=545.65 size=0 width=0) ->  Sort  (cost=545.65 size=0 width=0)       ->  Seq Scan on t  (cost=545.65
size=11262width=12)
 

EXPLAIN

Oops.  The sort is evidently being done on the order-by column,
thus there's no guarantee that the duplicate i values will be
brought together.

Still, I'd expect the Unique pass to get rid of duplicates that did
manage to be adjacent.  Your example shows that that's not happening.
So Unique has some unexpected dependency on how the prior sort pass
is done.  (Maybe it assumes that the sort pass is on *exactly* the
columns that were requested to be distinct, and re-uses the sort's
comparison results somehow?)

It seems clear that the query tree builder and the Unique executor
have different assumptions about how this is supposed to work.  That
probably will yield bugs, so it needs to be looked at.

But I'm not convinced that the example you give is a sensible query
at all.  If you have done a SELECT DISTINCT on column A, what does it
mean to specify that the results are sorted by column B?  A single row
of the result might represent many table rows with different values of
column B --- how should the system know which of those B values to use
for sorting the result?

My guess is that if you look into the SQL standard, this combination
is forbidden.  Thus the *real* bug is that Postgres isn't rejecting it.

I'd still recommend that someone take a close look at how this example
is behaving now, because even if it should be kicked out as a user
error, there might be related legal cases that are also having trouble.

If we did want to make this example behave in a rational way, then
probably the right implementation is something like
* sort by i,j* distinct-filter on i only, being careful to keep first row    in each set of duplicates* sort by j

This would ensure that the final sort by j uses, for each distinct i,
the lowest of the j-values associated with that i.  This is a totally
arbitrary decision, but at least it will give reproducible results.
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Terry Mackintosh
Дата:
Сообщение: regression tests
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] regression tests