Re: Built-in CTYPE provider

Поиск
Список
Период
Сортировка
От Jeff Davis
Тема Re: Built-in CTYPE provider
Дата
Msg-id 7089acb3ebac0c1682a79c8bc16803cf06896fb9.camel@j-davis.com
обсуждение исходный текст
Ответ на Re: Built-in CTYPE provider  ("Daniel Verite" <daniel@manitou-mail.org>)
Список pgsql-hackers
On Mon, 2024-01-15 at 15:30 +0100, Daniel Verite wrote:
> Concerning the target category_test, it produces failures with
> versions of ICU with Unicode < 15. The first one I see with Ubuntu
> 22.04 (ICU 70.1) is:

...

> I find these results interesting because they tell us what contents
> can break regexp-based check constraints on upgrades.

Thank you for collecting and consolidating this information.

> But about category_test as a pass-or-fail kind of test, it can only
> be
> used when the Unicode version in ICU is the same as in Postgres.

The test has a few potential purposes:

1. To see if there is some error in parsing the Unicode files and
building the arrays in the .h file. For instance, let's say the perl
parser I wrote works fine on the Unicode 15.1 data file, but does
something wrong on the 16.0 data file: the test would fail and we'd
investigate. This is the most important reason for the test.

2. To notice any quirks between how we interpret Unicode vs how ICU
does.

3. To help see "interesting" differences between different Unicode
versions.

For #1 and #2, the best way to test is by using a version of ICU that
uses the same Unicode version as Postgres. The one running update-
unicode can try to recompile with the right one for the purposes of the
test. NB: There might be no version of ICU where the Unicode version
exactly matches what we'd like to update to. In that case, we'd need to
use the closest version and do some manual validation that the
generated tables are sane.

For #3, that is also interesting information to know about, but it's
not directly actionable. As you point out, Unicode does not guarantee
that these properties are static forever, so regexes can change
behavior when we update Unicode for the next PG version. That is a much
lower risk than a collation change, but as you point out, is a risk for
regexes inside of a CHECK constraint. If a user needs zero risk of
semantic changes for regexes, the only option is "C". Perhaps there
should be a separate test target for this mode so that it doesn't exit
early?

(Note: case mapping has much stronger guarantees than the character
classification.)

I will update the README to document how someone running update-unicode
should interpret the test results.

Regards,
    Jeff Davis




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Matthias van de Meent
Дата:
Сообщение: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Следующее
От: Robert Haas
Дата:
Сообщение: Re: gai_strerror() is not thread-safe on Windows