Re: sortsupport for text

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: sortsupport for text
Дата
Msg-id CAEYLb_UenB9wbztuv9sUJsOA_ohojuwiFixR1P8GJuQMsYT9tQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: sortsupport for text  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: sortsupport for text  (Florian Pflug <fgp@phlo.org>)
Список pgsql-hackers
On 20 June 2012 17:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <peter@2ndquadrant.com> writes:
>> No, I'm suggesting it would probably be at least a bit of a win here
>> to cache the constant, and only have to do a strxfrm() + strcmp() per
>> comparison.
>
> Um, have you got any hard evidence to support that notion?  The
> traditional advice is that strcoll is faster than using strxfrm unless
> the same strings are to be compared repeatedly.  I'm not convinced that
> saving the strxfrm on just one side will swing the balance.

Fair enough,  but I'd suggest that that traditional advice assumes
that strcoll()'s space-efficiency and general avoidance of dynamic
allocation is an important factor, which, for us, it clearly isn't,
since we can re-use a single buffer in the manner of Robert's text
sortsupport patch for each and every per-tuple strxfrm() blob (when
traversing an index). This is a direct quote from glibc's strcoll_l.c
:
    Perform the first pass over the string and while doing this find    and store the weights for each character.
Sincewe want this to    be as fast as possible we are using `alloca' to store the temporary    values.  But since there
isno limit on the length of the string    we have to use `malloc' if the string is too long.  We should be    very
conservativehere. 

Here, alloca is used to allocate space in a stack frame. I believe
that this is an entirely inappropriate trade-off for Postgres to be
making. strxfrm(), in constrast, leaves buffer sizing and management
up to the caller. That has to be a big part of the problem here.

>> The fact is that this is likely to be a fairly significant
>> performance win, because strxfrm() is quite simply the way you're
>> supposed to do collation-aware sorting, and is documented as such. For
>> that reason, C standard library implementations should not be expected
>> to emphasize its performance - they assume that you're using strxfrm()
>> + their highly optimised strcmp()
>
> Have you got any evidence in support of this claim, or is it just
> wishful thinking about what's likely to be inside libc?

According to the single-unix specification's strcoll() documentation,
"The strxfrm() and strcmp() functions should be used for sorting large
lists". If that isn't convincing enough for you, there is the fact
that glibc's strcmp() is clearly highly optimised for each and every
architecture, and that we are currently throwing away an extra
strcmp() in the event of strcoll() equality.

> I'd also note that any comparisons you may have seen about this are certainly not
> accounting for the effects of data bloat from strxfrm (ie, possible
> spill to disk, more merge passes, etc).

What about the fact that strcoll() may be repeatedly allocating and
freeing memory per comparison? The blobs really aren't that much
larger than the strings to be sorted, which are typically quite short.

> In any case, if you have to redefine the meaning of equality in order
> to justify a performance patch, I'm prepared to walk away at the start.

The advantage of my proposed implementation is precisely that I won't
have to redefine the meaning of equality, and that only the text
datatype will have to care about equivalency, so you can just skip
over an explanation of equivalency for most audiences.

If you feel that strongly about it, and I have no possible hope of
getting this accepted, I'm glad that I know now rather than after
completing a significant amount of work on this. I would like to hear
other people's opinions before I drop it though.

> The range of likely performance costs/benefits across different locales
> and different implementations is so wide that if you can't show it to be
> a win even with the strcmp tiebreaker, it's not likely to be a reliable
> win without that.

glibc is the implementation that really matters. My test-case used
en_US.UTF-8 as its locale, which has to be one of the least stressful
to strcoll() - I probably could have shown a larger improvement just
by selecting a locale that was known to have to make more passes,
like, say, hu_HU.UTF-8.

I expect to have access to a 16 core server next week, which Bull have
made available to me. Maybe I'll get some interesting performance
numbers from it.

The reason that I don't want to use the blob with original string hack
is because it's ugly, space-inefficient, unnecessary and objectively
incorrect, since it forces us to violate the conformance requirement
C9 of Unicode 3.0, marginal though that may be.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Marc Mamin"
Дата:
Сообщение: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Следующее
От: Josh Berkus
Дата:
Сообщение: Re: Nasty, propagating POLA violation in COPY CSV HEADER