Re: Built-in CTYPE provider

Поиск
Список
Период
Сортировка
От Jeff Davis
Тема Re: Built-in CTYPE provider
Дата
Msg-id d4616159b36de9451edd35fa7b2f36f299005c9c.camel@j-davis.com
обсуждение исходный текст
Ответ на Re: Built-in CTYPE provider  (Jeremy Schneider <schneider@ardentperf.com>)
Ответы Re: Built-in CTYPE provider
Список pgsql-hackers
On Fri, 2023-12-15 at 16:30 -0800, Jeremy Schneider wrote:
> Looking closer, patches 3 and 4 look like an incremental extension of
> this earlier idea;

Yes, it's essentially the same thing extended to a few more files. I
don't know if "incremental" is the right word though; this is a
substantial extension of the idea.

>  the perl scripts download data from unicode.org and
> we've specifically defined Unicode version 15.1 and the scripts turn
> the
> data tables inside-out into C data structures optimized for lookup.
> That
> C code is then checked in to the PostgreSQL source code files
> unicode_category.h and unicode_case_table.h - right?

Yes. The standard build process shouldn't be downloading files, so the
static tables are checked in. Also, seeing the diffs of the static
tables improves the visibility of changes in case there's some mistake
or big surprise.

> Am I reading correctly that these two patches add C functions
> pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we
> don't yet reference these functions anywhere? So this is just getting
> some plumbing in place?

Correct. Perhaps I should combine these into the builtin provider
thread, but these are independently testable and reviewable.

> >
> My prediction is that updating this built-in provider eventually
> won't
> be any different from ICU or glibc.

The built-in provider will have several advantages because it's tied to
a PG major version:

  * A physical replica can't have different semantics than the primary.
  * Easier to document and test.
  * Changes are more transparent and can be documented in the release
notes, so that administrators can understand the risks and blast radius
at pg_upgrade time.

> Later on down the road, from a user perspective, I think we should be
> careful about confusion where providers are used inconsistently. It's
> not great if one function follow built-in Unicode 15.1 rules but
> another
> function uses Unicode 13 rules because it happened to call an ICU
> function or a glibc function. We could easily end up with multiple
> providers processing different parts of a single SQL statement, which
> could lead to strange results in some cases.

The whole concept of "providers" is that they aren't consistent with
each other. ICU, libc, and the builtin provider will all be based on
different versions of Unicode. That's by design.

The built-in provider will be a bit better in the sense that it's
consistent with the normalization functions, and the other providers
aren't.

Regards,
    Jeff Davis






В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: add non-option reordering to in-tree getopt_long
Следующее
От: "Daniel Verite"
Дата:
Сообщение: Fixing backslash dot for COPY FROM...CSV