Re: Duplicate Values or Not?!

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: Duplicate Values or Not?!
Дата
Msg-id 87fys3r8vf.fsf@stark.xeocode.com
обсуждение исходный текст
Ответ на Re: Duplicate Values or Not?!  (Greg Stark <gsstark@mit.edu>)
Ответы Re: Duplicate Values or Not?!  (Martijn van Oosterhout <kleptog@svana.org>)
Список pgsql-general
Greg Stark <gsstark@MIT.EDU> writes:

> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
> > If that does change the results, it indicates you've got strings which
> > are bytewise different but compare equal according to strcoll().  We've
> > seen this and other misbehaviors from some locale definitions when faced
> > with data that is invalid per the encoding the locale expects.
>
> There are plenty of non-bytewise-identical strings that do legitimately
> compare equal in various locales. Does the hash code hash strxfrm or the
> original bytes?

Hm. Some experimentation shows that at least on glibc's locale definitions the
strings that I thought compared equal don't actually compare equal.
Capitalization, punctuation, white space, while they're basically ignored in
general in non-C locales do seem to compare non-equal when they're the only
differentiating factor.

Is this guaranteed by any spec? Or is counting on this behaviour unsafe?

If it's legal for strcoll to compare as equal two byte-wise different strings
then the hash function really ought to be calling strxfrm before hashing or
else it will be inconsistent. It doesn't seem to be doing so currently.

I find it interesting that Perl has faced this same dilemma and chose to
override the locale definition in this case. If the locale definitions
compares two strings equally then Perl does a bytewise comparison and uses
that to break ties. This guarantees non-bytewise-identical strings don't
compare eqal. I suspect they did it for a similar reason too, namely keeping
the semantics in sync with perl hashes.

Postgres could follow that model, I think it would solve any inconsistencies
just fine and not cause problems. However it would be visible to users which
may be considered a bug if the locale really does claim the strings are equal
but Postgres doesn't agree. On the other hand I think it would perform better
than a lot of extra calls to strxfrm since it would only rarely kick in with
an extra memcmp.

--
greg

В списке pgsql-general по дате отправления:

Предыдущее
От: Greg Stark
Дата:
Сообщение: Re: Duplicate Values or Not?!
Следующее
От: Martijn van Oosterhout
Дата:
Сообщение: Re: Duplicate Values or Not?!