Re: insensitive collations

Поиск
Список
Период
Сортировка
От Peter Eisentraut
Тема Re: insensitive collations
Дата
Msg-id b6b28f54-a667-87c2-870f-c9f07dd796ee@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: insensitive collations  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: insensitive collations  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Список pgsql-hackers
On 2019-02-21 03:17, Peter Geoghegan wrote:
> I wonder if it would be better to break this into distinct commits?

I thought about that.  Especially the planner/executor changes could be
done separately, sort of as a way to address the thread
"ExecBuildGroupingEqual versus collations".  But I'm not sure if they
would have good test coverage on their own.  I can work on this if
people think this would be useful.

> * Why is support for non-deterministic comparisons limited to the ICU
> provider? If that's the only case that could possibly be affected,
> then why did we ever add the varstrcmp() tie-breaker call to strcmp()
> in the first place? The behavior described in the commit message of
> bugfix commit 656beff5 describes a case where Hungarian text caused
> index corruption by being strcoll()-wise equal but not bitwise equal.
> Besides, I think that you can vendor your own case insensitive
> collation with glibc, since it's based on UCA.

The original test case (described here:
https://www.postgresql.org/message-id/27064.1134753128%40sss.pgh.pa.us)
no longer works, so it was probably fixed on the glibc side.  The git
log of the hu_HU locale definition shows that it has been "fixed" in
major ways several times over the years, so that's plausible.

I tried reproducing some more practical scenarios involving combining
diacritics, but glibc apparently doesn't believe strings in different
normal forms are equal.  At that point I gave up because this doesn't
seem worthwhile to support.

Moreover, I think allowing this would require a "trusted" strxfrm(),
which is currently disabled.

>> +   return DatumGetBool(FunctionCall2Coll(&entry->eq_opr_finfo,
>> +                                         DEFAULT_COLLATION_OID,
>> +                                         oldvalue, newvalue));
>>  }
> 
> The existing restriction on ICU collations (that they cannot be
> database-wide) side-steps the issue.
> 
> * Can said restriction somehow be lifted? That seems like it would be
> a lot cleaner.

Lift what restriction?  That ICU collations cannot be database-wide?
Sure that would be nice, but it's a separate project.  Even then, I'm
not sure that we would allow a database-wide collation to be
nondeterministic.  That would for example disallow the use of LIKE,
which would be weird.  In any case, the above issue can be addressed
then.  I think it's not worth complicating this right now.

> * Why have you disable this optimization?:
> 
>>     /* Fast pre-check for equality, as discussed in varstr_cmp() */
>> -   if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
>> +   if ((!sss->locale || sss->locale->deterministic) &&
>> +       len1 == len2 && memcmp(a1p, a2p, len1) == 0)
> 
> I don't see why this is necessary. A non-deterministic collation
> cannot indicate that bitwise identical strings are non-equal.

Right, I went too far there.

> * Perhaps you should add a "Tip" referencing the feature to the
> contrib/citext documentation.

Good idea.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Jamison, Kirk"
Дата:
Сообщение: RE: Timeout parameters
Следующее
От: mitani
Дата:
Сообщение: [PATCH] Allow transaction to partition table using FDW