Re: sortsupport for text

Поиск
Список
Период
Сортировка
От Florian Pflug
Тема Re: sortsupport for text
Дата
Msg-id EC561C64-6A13-4FE6-87E1-B13B347A9DF6@phlo.org
обсуждение исходный текст
Ответ на Re: sortsupport for text  (Peter Geoghegan <peter@2ndquadrant.com>)
Ответы Re: sortsupport for text  (Peter Geoghegan <peter@2ndquadrant.com>)
Список pgsql-hackers
On Jun21, 2012, at 11:55 , Peter Geoghegan wrote:
> On 21 June 2012 10:24, Florian Pflug <fgp@phlo.org> wrote:
>> On Jun21, 2012, at 02:22 , Peter Geoghegan wrote:
>>> I've written a very small C++ program, which I've attached, that
>>> basically proves that this can still make a fairly large difference -
>>> I hope it's okay that that it's C++, but that allowed me to write the
>>> program quickly, with no dependencies for anyone who cares to try this
>>> out, other than a C++ compiler and the standard library.
>>
>> Uh, are you sure the results aren't swapped? string_wrapper takes a
>> parameter called use_strcoll, and it indeed uses strcoll() if that
>> parameter is true, and strxfm() otherwise. But then it passes *false*
>> to determine the speed of strcoll(), and *true* to determine the speed
>> of strxfm()...
>
> Egads, you're right. Apparently in my haste I gave the wrong boolean
> argument to the class constructor in each case. I am completely wrong
> about this. I apologise for the careless mistake. At least I advanced
> the debate, though - we don't want to do any ad-hoc generation of
> strxfrm() output within comparators, even when one side can have a
> cached value.

FWIW, I've played around a bit with a fixed version of your set_test.cpp.
I made it use the cached sort key of the RHS also, made it preserve sort
keys during copy construction, even used a boost::shared_ptr to avoid
actually copying the cached sort key. The strxfrm() version still performed
about 30% worse than the strcoll() version.

From that, I figured that tree inserts don't trigger enough comparisons to
make strxfrm() worthwhile. So I mangled set_test.cpp a bit more and instead
of a set::set, I created an (C-style) array of string_wrapper instances,
and sorted them with std::sort(). Which made strcoll() twice as fast as
strxfrm()...

At this point, my theory is that your choice of "random" strings prevents
strxfrm() from ever winning over strcoll(). The reason being that you pick
each letter uniformly distributed from a-z, resulting in a probability of
two string differing in the first of 1 - 1/26 =~ 96%. Thus, even for
extremely complex collation rules, strcoll() probably only needs to compare
a few characters to determine the order. Whereas strxfrm() has to compute
the whole sort key, no matter what.

The question is thus, how good a model are your "random" strings for the
input of a typical sorting step in postgres? My guess is, a quite good one
actually, since people probably don't deal with a lot of very similar strings
very often. Which makes we wonder if using strxfrm() during sorting wouldn't
be a net loss, all things considered…

My modified set_test.cpp (which should now be called array_test.cpp)
is attached.

best regards,
Florian Pflug

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: sortsupport for text
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: sortsupport for text