Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Дата
Msg-id CAM3SWZTmvgnUZBNLst0Sv3bwwP+9uVF56ASbJpTmwc97_pFRvA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)  (Peter Geoghegan <pg@heroku.com>)
Список pgsql-bugs
On Mon, Mar 21, 2016 at 9:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> On RHEL6, I get
>
> ./strxfrm-binary de_DE.UTF-8 'eai' 'e a=C3=AD'
> "eai" -> 100c140108080801020202 (11 bytes)
> "e a=C3=AD" -> 100c140108080901020202010235 (14 bytes)

As expect, ISTM that the "primary weights" here are the same.

Aligned comparison of this with correct en_US.UTF-8 blobs from my system:

Buggy version (Tom's de_DE.UTF-8 testcase):

"eai" ->  100c14 01 090909 01 090909 (11 bytes)
"e a=C3=AD" -> 100c14 01 0b0909 01 090909010235 (14 bytes)

Correct version (though uses different locale):

"eai" ->  100c14 01 080808 01 020202 (11 bytes)
"e a=C3=AD" -> 100c14 01 080809 01 020202010235 (14 bytes)

The low bytes, 0x01, separate the weight levels,. I think that this
always happens with glibc. The space character is only represented at
the last level, which is why strcoll() typically weighs spaces as very
unimportant (you'll recall that we here complaints about this from
time to time).

My guess is that the 0x0b byte in Tom's buggy de_DE.UTF-8 testcase is
the problem. Not sure why.

I guess I'll look around here for further ideas tomorrow:
http://unicode.org/reports/tr10/#Well_Formedness_Examples

> This seems a bit problematic, because these string sort in the other
> order ("e a=C3=AD" before "eai") according to sort(1) as well as Postgres
> sorting code.
>
> It's possible I've copied-and-pasted these multibyte characters wrong.
> But if I haven't, this says that the strxfrm-based optimization is
> unusably broken on a very large fraction of reasonably-modern
> installations.  Quite aside from casting aspersions on the glibc guys,
> how did we fail to notice this in our own testing?

Because we don't test every possible libc installations. And even if
we did, why should we be able to usefully nail down something that's
fundamentally not under our control? (I don't want to assume that that
bug is at fault, but it seems like a reasonable speculation,
especially based on your "strxfrm-binary" result.)

Let's not relitigate the debate about Postgres controlling its own
collations right now, though.

I think that amcheck will be able to provide reasonable smoke-testing
for these kinds of issues once it gets some buildfarm cycles. I intend
to write plenty of tests for external sorting to go with amcheck, too;
that code currently has no tests whatsoever. amcheck provides a nice
way of testing if strxfrm() agrees with strcoll(), without having to
"expect" any particular total ordering for a collatable type, which is
what a simple pg_regress approach would require. Portable testing of
strcoll() + strxfrm() will improve matters.

--=20
Peter Geoghegan

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)