Re: Built-in CTYPE provider

Поиск
Список
Период
Сортировка
От Jeremy Schneider
Тема Re: Built-in CTYPE provider
Дата
Msg-id 8a1ae216-8150-41e2-a98d-09c57e3dc90f@ardentperf.com
обсуждение исходный текст
Ответ на Built-in CTYPE provider  (Jeff Davis <pgsql@j-davis.com>)
Ответы Re: Built-in CTYPE provider  (Jeremy Schneider <schneider@ardentperf.com>)
Re: Built-in CTYPE provider  (Jeff Davis <pgsql@j-davis.com>)
Список pgsql-hackers
On 12/5/23 3:46 PM, Jeff Davis wrote:
> CTYPE, which handles character classification and upper/lowercasing
> behavior, may be simpler than it first appears. We may be able to get
> a net decrease in complexity by just building in most (or perhaps all)
> of the functionality.
> 
> === Character Classification ===
> 
> Character classification is used for regexes, e.g. whether a character
> is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
> class. Unicode defines what character properties map into these
> classes in TR #18 [1], specifying both a "Standard" variant and a
> "POSIX Compatible" variant. The main difference with the POSIX variant
> is that symbols count as punctuation.
> 
> === LOWER()/INITCAP()/UPPER() ===
> 
> The LOWER() and UPPER() functions are defined in the SQL spec with
> surprising detail, relying on specific Unicode General Category
> assignments. How to map characters seems to be left (implicitly) up to
> Unicode. If the input string is normalized, the output string must be
> normalized, too. Weirdly, there's no room in the SQL spec to localize
> LOWER()/UPPER() at all to handle issues like [1]. Also, the standard
> specifies one example, which is that "ß" becomes "SS" when folded to
> upper case. INITCAP() is not in the SQL spec.

I'll be honest, even though this is primarily about CTYPE and not
collation, I still need to keep re-reading the initial email slowly to
let it sink in and better understand it... at least for me, it's complex
to reason through. 🙂

I'm trying to make sure I understand clearly what the user impact/change
is that we're talking about: after a little bit of brainstorming and
looking through the PG docs, I'm actually not seeing much more than
these two things you've mentioned here: the set of regexp_* functions PG
provides, and these three generic functions. That alone doesn't seem
highly concerning.

I haven't checked the source code for the regexp_* functions yet, but
are these just passing through to an external library? Are we actually
able to easily change the CTYPE provider for them? If nobody
knows/replies then I'll find some time to look.

One other thing that comes to mind: how does the parser do case folding
for relation names? Is that using OS-provided libc as of today? Or did
we code it to use ICU if that's the DB default? I'm guessing libc, and
global catalogs probably need to be handled in a consistent manner, even
across different encodings.

(Kindof related... did you ever see the demo where I create a user named
'🏃' and then I try to connect to a database with non-unicode encoding?
💥😜  ...at least it seems to be able to walk the index without decoding
strings to find other users - but the way these global catalogs work
scares me a little bit)

-Jeremy


-- 
http://about.me/jeremy_schneider




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: pg_upgrade failing for 200+ million Large Objects
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Add isCatalogRel in rmgrdesc