Re: fixing tsearch locale support
От | Daniel Verite |
---|---|
Тема | Re: fixing tsearch locale support |
Дата | |
Msg-id | 15e97660-9e3c-43a2-8cad-7b33fc7f7476@manitou-mail.org обсуждение исходный текст |
Ответ на | Re: fixing tsearch locale support (Peter Eisentraut <peter@eisentraut.org>) |
Список | pgsql-hackers |
Peter Eisentraut wrote: > There is a PG18 open item to document this possible upgrade incompatibility. > > I think the following text could be added to the release notes: > > """ > The locale implementation underlying full-text search was improved. It > now observes the locale provider configured for the database. It was > previously hardcoded to use the configured libc LC_CTYPE setting > [...] That sounds misleading because LC_CTYPE is still used in 18. To illustrate in an ICU database, the parser will classify "Em Dash" as a separator or not depending on LC_CTYPE. with LC_CTYPE=C => select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH'); alias | token | lexemes -------+-----------+------------- word | ABCD—EFGH | {abcd—efgh} with LC_CTYPE=en_US.utf8 (glibc 2.35): => select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH'); alias | token | lexemes -----------+-------+--------- asciiword | ABCD | {abcd} blank | — | asciiword | EFGH | {efgh} OTOH lower casing uses LC_CTYPE in 17, but not in 18, leading to better lexemes. pg17, ICU locale, LC_TYPE=C => select alias, token,lexemes from ts_debug('simple', 'ÉTÉ'); alias | token | lexemes -------+-------+--------- word | ÉTÉ | {ÉtÉ} pg18, ICU locale, LC_TYPE=C select alias, token,lexemes from ts_debug('simple', 'ÉTÉ'); alias | token | lexemes -------+-------+--------- word | ÉTÉ | {été} So maybe the release notes should say "now observes the locale provider configured for the database to convert strings to lower case". Best regards, -- Daniel Vérité https://postgresql.verite.pro/
В списке pgsql-hackers по дате отправления: