Re: Issue Supporting Emojis in Full Text Search on Ubuntu

Поиск
Список
Период
Сортировка
От Jordan Hurwich
Тема Re: Issue Supporting Emojis in Full Text Search on Ubuntu
Дата
Msg-id CAJKqsjvUZUwyYT6438C8VjnWzDJ9EL7qM_-pV0nwdQpERgW1MA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Issue Supporting Emojis in Full Text Search on Ubuntu  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Issue Supporting Emojis in Full Text Search on Ubuntu
Список pgsql-novice
Thank you very much Tom! 

You were completely right to question our previous manipulation of the pg_database entry, and after revisiting that we were able to address the issue by updating datcollate and datctype to "C" - ts_debug() now recognizes emoji characters as "word" characters and text search functionality is behaving as expected.

Thank you once again for taking the time to understand and figure out our issue,
Jordan

On Tue, Jun 30, 2020 at 5:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jordan Hurwich <jhurwich@pulsasensors.com> writes:
> We rely on the Postgres tsvector implementation to enable full text search
> in our app, but we're having some issues with getting the parser to
> recognize emoji characters (like "😀" <U+1F600>) as anything other than
> "blank"/"Space symbols" on Ubuntu per ts_debug(). Notably the characters
> are recognized as "word"/"Word, all letters" characters on Mac; and
> non-english, non-emoji characters (like "我" <U+6211>) are recognized as
> "word" characters on both Mac and Ubuntu.

With a non-C lc_ctype setting, which characters are considered to be
letters is entirely up to the locale code supplied by the platform's libc.
I duplicate your result that U+1F600 is not considered a letter by glibc,
at least not with lc_ctype = en_US.utf8.  (Perhaps there are other locale
settings that would consider it a letter?  Not my expertise though.)

However, with lc_ctype set to C, you should get a platform-independent
result that any non-ASCII character is a letter.

macOS' utf8-based locales are almost completely broken, unfortunately,
so it's hard to predict what they will do except that it'll probably be
wrong.

> We have not modified the 'english' text search configuration on either
> instance, however the query "SELECT * FROM ts_debug('english', '😀');"
> returns different results on MacOS 10.15.5 and our Ubuntu instance:

That's unsurprising in itself, per the above.  There's no standardization
worth mentioning in this area.

> There are minor differences between MacOS and Ubuntu in pg_database as
> follows, however modifications to set datcollate and datctype to 'C' on
> Ubuntu or the more specific 'en_US.UTF-8' have not changed the result for
> ts_debug().

lc_ctype = C should have done what you want (at least for this specific
symbol), so I wonder whether you did the test right.  You gave few details
about how you tested this.

> db=# select datname, encoding, datcollate, datctype, datistemplate from
> pg_database;
>     datname     | encoding | datcollate | datctype | datistemplate
> ----------------+----------+------------+----------+---------------
>  postgres       |        6 | C          | C        | f
>  template0      |        6 | C          | C        | t
>  template1      |        6 | C          | C        | t
>  testdb01       |        6 | C          | C        | f

Hm, how are you getting that result when your prompt says you're
connected to database "db"?

> The result of `$ locale` on both instances is similar, included below for
> Ubuntu. Though `$ locale -a` varies considerably, on MacOS dozens of items
> are returned while only 4 entries are returned on Ubuntu, included below:
> - on Ubuntu
> $ locale -a
> C
> C.UTF-8
> en_US.utf8
> POSIX

This just indicates that you didn't install the package(s) that provide
alternative locales.  On my RHEL box, "locale -a" reports 865 entries,
and I'm pretty sure I don't have all the odder ones.  I do not know,
unfortunately, how Ubuntu/Debian divvy this stuff up into packages.

Anyway, the bottom line is that if you need platform-independent results
then setting lc_ctype to "C" is what to do.  It will not be very bright
about non-ASCII characters, but at least the stupidity will be uniform.

                        regards, tom lane

В списке pgsql-novice по дате отправления:

Предыдущее
От: Laurenz Albe
Дата:
Сообщение: Re: PostgreSQL Module to use Oracle specific functions as it is.
Следующее
От: Bzzzz
Дата:
Сообщение: Re: Issue Supporting Emojis in Full Text Search on Ubuntu