Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Дата
Msg-id CAM3SWZR+3DFafNyiJX8daJfLLdANksbg2TDEMe4qUV7VFuc0Ng@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
On Tue, Mar 22, 2016 at 7:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> My concern was not merely "academic" (i.e. it was not limited in scope
>> to things that don't make B-Tree indexes corrupt). Pretty sure that we
>> need to start thinking of this as a problem with strcoll() that
>> strxfrm() does not have for more fundamental reasons, because
>> strcoll() says that the first string in the de_DE sorted list is
>> *greater* than the third string.
>
> [ squint... ]  I was looking specifically for that sort of misbehavior
> in my test program, and I haven't seen it.

Sorry, I was in too much of a hurry to get to the bottom of this with
that example. I failed to notice that LC_COLLATE for sort was "de_DE",
not "de_DE.UTF-8". For my simple case it would not have mattered if
"de_DE" was specified instead of "de_DE.UTF-8" on a non-broken system.
But, this was a broken system.

Anyway, what prompted the misguided example was this:

[vagrant@localhost ~]$ ./strxfrm-binary de_DE.UTF-8 'x xx"' 'xxx"'
"x xx"" -> 2323230108080801020202010235034b (16 bytes)
"xxx"" -> 232323010808080102020201044b (14 bytes)
strcmp(arg1, arg2) result: -2
strcoll(arg1, arg2) result: -6
[vagrant@localhost ~]$ ./strxfrm-binary de_DE.UTF-8 'x xxf' 'xxxf'
"x xxf" -> 2323231101080808080102020202010235 (17 bytes)
"xxxf" -> 2323231101080808080102020202 (14 bytes)
strcmp(arg1, arg2) result: 1
strcoll(arg1, arg2) result: -6

Notice that case where a double-quote is used makes strxfrm() and
strcoll() agree. Whereas if that character is a character from the
Latin Alphabet instead, they disagree.

My intuition is that this is significant from the point of view of
fixing the glibc strcoll() bug. It feels like there is an incorrectly
applied optimization here, that occurs for strcoll() but not the
separate transformation process that strxfrm() does.

There seems to be at least a few instances of over-optimizing
strcoll() in the past few years. For example:
https://github.com/bminor/glibc/commit/87701a58e291bd7ac3b407d10a829dac52c9c16e

This bug looks like a possible candidate, given that complaints were
about de_DE:

https://github.com/bminor/glibc/commit/33a667def79c42e0befed1a4070798c58488170f

Is this bug of the right vintage? Seems like it might be a bit too
early for RHEL 6 to be affected, but I'm no expert.

--
Peter Geoghegan

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)