Обсуждение: `pg_trgm` not recognizing Chinese characters in macOS

Поиск
Список
Период
Сортировка

`pg_trgm` not recognizing Chinese characters in macOS

От
Haotian Yang
Дата:
Versions: macOS 10.13.6, PostgreSQL 10.5, pg_trgm 1.3.
LC_ALL=en_US.UTF-8

reproduce:
- enter psql as admin.
- `CREATE EXTENSION pg_trgm`.
- `SELECT show_trgm(‘一个句子’)`.

expects:
- something like `{0x…,0x…}`

gets:
- `{}`


Re: `pg_trgm` not recognizing Chinese characters in macOS

От
Tom Lane
Дата:
Haotian Yang <yangnw@live.com> writes:
> Versions: macOS 10.13.6, PostgreSQL 10.5, pg_trgm 1.3.
> LC_ALL=en_US.UTF-8

pg_trgm relies on libc's functions (specifically, iswalpha()) to determine
what is a word character or not.  Unfortunately, the UTF8 locale support
in macOS is pretty incomplete, and I don't find it too surprising that
it's not recognizing Chinese characters as alphabetic.  Now, you could
make a good argument that they *shouldn't* be considered alphabetic in
an en_US locale; but I'm unsure whether switching to a more appropriate
locale will help.

Anyway, I'd first try zh_CN.UTF-8, and if that doesn't fix it, the place
to complain is https://bugreport.apple.com/ ... I'm sure they know about
it already, but the number of reports has an impact on how fast they
fix things.

            regards, tom lane


回复:`pg_trgm` not recognizing Chinese characters in macOS

От
"周正中(德歌)"
Дата:
you should use lc_ctype not to C.

```
postgres=# \l
                                 List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges   
-----------+----------+----------+------------+------------+-----------------------
 newdb     | postgres | UTF8     | en_US.UTF8 | en_US.UTF8 | 
 postgres  | postgres | UTF8     | en_US.UTF8 | en_US.UTF8 | 
 template0 | postgres | UTF8     | en_US.UTF8 | en_US.UTF8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF8 | en_US.UTF8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
(4 rows)


postgres=# select show_trgm('hello你好');
                      show_trgm                       
------------------------------------------------------
 {0xcf7970,0xfe5170,0x114ebf,"  h"," he",ell,hel,llo}
(1 row)

postgres=# create database testdb with template template0 lc_ctype='C';
CREATE DATABASE
postgres=# \c testdb
You are now connected to database "testdb" as user "postgres".
testdb=# create extension pg_trgm;
CREATE EXTENSION
testdb=# select show_trgm('hello你好');
            show_trgm            
---------------------------------
 {"  h"," he",ell,hel,llo,"lo "}
(1 row)
```
------------------------------------------------------------------
发件人:Tom Lane <tgl@sss.pgh.pa.us>
发送时间:2018年9月11日(星期二) 21:20
收件人:Haotian Yang <yangnw@live.com>
抄 送:pgsql-bugs@postgresql.org <pgsql-bugs@postgresql.org>
主 题:Re: `pg_trgm` not recognizing Chinese characters in macOS

Haotian Yang <yangnw@live.com> writes:
> Versions: macOS 10.13.6, PostgreSQL 10.5, pg_trgm 1.3.
> LC_ALL=en_US.UTF-8

pg_trgm relies on libc's functions (specifically, iswalpha()) to determine
what is a word character or not.  Unfortunately, the UTF8 locale support
in macOS is pretty incomplete, and I don't find it too surprising that
it's not recognizing Chinese characters as alphabetic.  Now, you could
make a good argument that they *shouldn't* be considered alphabetic in
an en_US locale; but I'm unsure whether switching to a more appropriate
locale will help.

Anyway, I'd first try zh_CN.UTF-8, and if that doesn't fix it, the place
to complain is https://bugreport.apple.com/ ... I'm sure they know about
it already, but the number of reports has an impact on how fast they
fix things.

   regards, tom lane