Обсуждение: ts_parse reports different between MacOS, FreeBSD/Linux
Hello,
We have an application whose test suite fails on MacOS when running the search tests on unicode characters.
I've narrowed it down to the following:
macos=# select * from ts_parse('default','天');
tokid | token
-------+-------
12 | 天
(1 row)
freebsd=# select * from ts_parse('default','天');
tokid | token
-------+-------
2 | 天
(1 row)
This has been bugging me for a while, but it's a test our devs using MacOS just ignores for now as we know it passes
ourCI/CD pipeline on FreeBSD/Linux. It seems if anyone is shipping an app on MacOS and bundling Postgres they're going
tohave a bad time with searching.
Please let me know if there's anything I can do to help. Will gladly test patches.
Thanks,
--
Mark Felder
ports-secteam & portmgr alumni
feld@FreeBSD.org
"Mark Felder" <feld@FreeBSD.org> writes:
> We have an application whose test suite fails on MacOS when running the search tests on unicode characters.
Yeah, known problem :-(. The text search parser relies on the C library's
locale data to classify characters as being letters, digits, etc.
Unfortunately, the UTF8 locales on macOS are just horribly bad, and
report many results that are different from other platforms.
I suppose that Apple has got reasonable Unicode character knowledge
somewhere in their OS; they are just not very interested in making the
POSIX locale APIs work well. Which leaves us with a bit of a problem
for getting consistent results cross-platform.
regards, tom lane