Re: PoC: Partial sort

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: PoC: Partial sort
Дата
Msg-id CAPpHfdsMH9LyMRVaVoDqpw2auwXwSFt49QcciRO_p=vtboy2Ng@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PoC: Partial sort  (Peter Geoghegan <pg@heroku.com>)
Список pgsql-hackers
On Sun, Sep 14, 2014 at 7:39 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Fri, Sep 12, 2014 at 2:19 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Actually, higher cardinality skip columns is better. Sorting of smaller
> groups is faster than sorting larger groups of same size. Also, with smaller
> groups you achieve limit more accurate (in average), i.e. sort smaller
> amount of total rows.

Higher cardinality leading key columns are better for what? Do you
mean they're better in that those cases are more sympathetic to your
patch, or better in the general sense that they'll perform better for
the user? The first example query, that you chose to demonstrate your
patch had a leading, indexed column (column "v1") with much lower
cardinality/n_distinct than the column that had to be sorted on
(column "v2"). That was what my comments were based on.

When this feature is used, there will often be a significantly lower
cardinality in the leading, indexed column (as in your example).
Otherwise, the user might well have been happy to just order on the
indexed/leading column alone. That isn't definitely true, but it's
often true.

I just meant higher cardinality is cheaper for do partial sort. We could have some misunderstood because of notions "high" and "low" are relative. When cardinality is 1 then partial sort seems to be useless. When cardinality is few then it still could be cheaper to do sequential scan + sort rather than index scan + partial sort. When cardinality is close to number of rows then as you mentioned user probably don't need to sort by rest of columns. At least one exception is when user needs to force uniqueness of order. So, we need to target something in the middle of this two corner cases.
 
>> I'm not sure if that's worth it to more or less
>> duplicate heap_tuple_attr_equals() to save a "mere" n expensive
>> comparisons, but it's something to think about (actually, there are
>> probably less than even n comparisons in practice because there'll be
>> a limit).
>
> Not correct. Smaller groups are not OK. Imagine that two representations of
> same skip column value exists. Index may return them in any order, even
> change them one by one. In this case sorting on other column never takes
> place, while it should.

I think I explained this badly - it wouldn't be okay to make the
grouping smaller, but as you say we could tie-break with a proper
B-Tree support function 1 comparison to make sure we really had
reached the end of our grouping.

FWIW I want all bttextcmp()/varstr_cmp() comparisons to try a memcmp()
first, so the use of the equality operator with text in mind that you
mention may soon not be useful at all. The evidence suggests that
memcmp() is so much cheaper than special preparatory NUL-termination +
strcoll() that we should always try it first when sorting text, even
when we have no good reason to think it will work. The memcmp() is
virtually free. And so, you see why it might be worth thinking about
this when we already have reasonable confidence that many comparisons
will indicate that datums are equal. Other datatypes will have
expensive "real" comparators, not just text or numeric.

When strings are not equal bttextcmp still needs to use strcoll while texteq doesn't need that. So, it would be still advantage of using equality operator over comparison function: equality operator don't have to compare unequal values. However, real cost of this advantage depends on presorted columns cardinality as well.

------
With best regards,
Alexander Korotkov.  

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Marko Tiikkaja
Дата:
Сообщение: Re: pgcrypto: PGP signatures
Следующее
От: Alexander Korotkov
Дата:
Сообщение: Re: PoC: Partial sort