Re: Issue Supporting Emojis in Full Text Search on Ubuntu
От | Tom Lane |
---|---|
Тема | Re: Issue Supporting Emojis in Full Text Search on Ubuntu |
Дата | |
Msg-id | 547627.1593563243@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Issue Supporting Emojis in Full Text Search on Ubuntu (Jordan Hurwich <jhurwich@pulsasensors.com>) |
Ответы |
Re: Issue Supporting Emojis in Full Text Search on Ubuntu
Re: Issue Supporting Emojis in Full Text Search on Ubuntu |
Список | pgsql-novice |
Jordan Hurwich <jhurwich@pulsasensors.com> writes: > We rely on the Postgres tsvector implementation to enable full text search > in our app, but we're having some issues with getting the parser to > recognize emoji characters (like "ð" <U+1F600>) as anything other than > "blank"/"Space symbols" on Ubuntu per ts_debug(). Notably the characters > are recognized as "word"/"Word, all letters" characters on Mac; and > non-english, non-emoji characters (like "我" <U+6211>) are recognized as > "word" characters on both Mac and Ubuntu. With a non-C lc_ctype setting, which characters are considered to be letters is entirely up to the locale code supplied by the platform's libc. I duplicate your result that U+1F600 is not considered a letter by glibc, at least not with lc_ctype = en_US.utf8. (Perhaps there are other locale settings that would consider it a letter? Not my expertise though.) However, with lc_ctype set to C, you should get a platform-independent result that any non-ASCII character is a letter. macOS' utf8-based locales are almost completely broken, unfortunately, so it's hard to predict what they will do except that it'll probably be wrong. > We have not modified the 'english' text search configuration on either > instance, however the query "SELECT * FROM ts_debug('english', 'ð');" > returns different results on MacOS 10.15.5 and our Ubuntu instance: That's unsurprising in itself, per the above. There's no standardization worth mentioning in this area. > There are minor differences between MacOS and Ubuntu in pg_database as > follows, however modifications to set datcollate and datctype to 'C' on > Ubuntu or the more specific 'en_US.UTF-8' have not changed the result for > ts_debug(). lc_ctype = C should have done what you want (at least for this specific symbol), so I wonder whether you did the test right. You gave few details about how you tested this. > db=# select datname, encoding, datcollate, datctype, datistemplate from > pg_database; > datname | encoding | datcollate | datctype | datistemplate > ----------------+----------+------------+----------+--------------- > postgres | 6 | C | C | f > template0 | 6 | C | C | t > template1 | 6 | C | C | t > testdb01 | 6 | C | C | f Hm, how are you getting that result when your prompt says you're connected to database "db"? > The result of `$ locale` on both instances is similar, included below for > Ubuntu. Though `$ locale -a` varies considerably, on MacOS dozens of items > are returned while only 4 entries are returned on Ubuntu, included below: > - on Ubuntu > $ locale -a > C > C.UTF-8 > en_US.utf8 > POSIX This just indicates that you didn't install the package(s) that provide alternative locales. On my RHEL box, "locale -a" reports 865 entries, and I'm pretty sure I don't have all the odder ones. I do not know, unfortunately, how Ubuntu/Debian divvy this stuff up into packages. Anyway, the bottom line is that if you need platform-independent results then setting lc_ctype to "C" is what to do. It will not be very bright about non-ASCII characters, but at least the stupidity will be uniform. regards, tom lane
В списке pgsql-novice по дате отправления:
Предыдущее
От: Jordan HurwichДата:
Сообщение: Issue Supporting Emojis in Full Text Search on Ubuntu
Следующее
От: Chamath SajeewaДата:
Сообщение: PostgreSQL Module to use Oracle specific functions as it is.