Обсуждение: Why can I not get lexemes for Hebrew but can get them for Armenian?

Поиск
Список
Период
Сортировка

Why can I not get lexemes for Hebrew but can get them for Armenian?

От
Sam Saffron
Дата:
(This is a cross post from Stack Exchange, not getting much traction there)

On my Mac install of PG:

```
=# select to_tsvector('english', 'abcd สวัสดี');
 to_tsvector
-------------
 'abcd':1
(1 row)

=# select * from ts_debug('hello สวัสดี');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
 blank     | Space symbols   |  สวัสดี | {}             |              |
(2 rows)
```

On my Linux install of PG:

```
=# select to_tsvector('english', 'abcd สวัสดี');
    to_tsvector
-------------------
 'abcd':1 'สวัสดี':2
(1 row)

=# select * from ts_debug('hello สวัสดี');
   alias   |    description    | token |  dictionaries  |  dictionary  | lexemes
-----------+-------------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII   | hello | {english_stem} | english_stem | {hello}
 blank     | Space symbols     |       | {}             |              |
 word      | Word, all letters | สวัสดี  | {english_stem} |
english_stem | {สวัสดี}
(3 rows)

```

So something is clearly different about the way the tokenisation is
defined in PG. My question is, how do I figure out what is different
and how do I make my mac install of PG work like the Linux one?

On both installs:

```
# SHOW default_text_search_config;
 default_text_search_config
----------------------------
 pg_catalog.english
(1 row)

# show lc_ctype;
  lc_ctype
-------------
 en_US.UTF-8
(1 row)
```

So somehow this mac install thinks that thai letters are spaces... how
do I debug this and fix the "Space Symbol" definition here.

Interestingly this install works with Armenian, but falls over when we
reach Hebrew

```
=# select * from ts_debug('ԵԵԵ');
 alias |    description    | token |  dictionaries  |  dictionary  | lexemes
-------+-------------------+-------+----------------+--------------+---------
 word  | Word, all letters | ԵԵԵ   | {english_stem} | english_stem | {եեե}
(1 row)

=# select * from ts_debug('אאא');
 alias |  description  | token | dictionaries | dictionary | lexemes
-------+---------------+-------+--------------+------------+---------
 blank | Space symbols | אאא   | {}           |            |
(1 row)
```


Re: Why can I not get lexemes for Hebrew but can get them for Armenian?

От
Tom Lane
Дата:
Sam Saffron <sam.saffron@gmail.com> writes:
> So something is clearly different about the way the tokenisation is
> defined in PG. My question is, how do I figure out what is different
> and how do I make my mac install of PG work like the Linux one?

I'm not sure you can :-(.  This devolves to what the libc locale
functions (isalpha(3) and friends) do, and unfortunately the UTF8
locales on OS X are impossibly lame.  They tend not to provide
useful character classifications for high Unicode code points.
They don't sort very well either, though that's not your problem here.

Depending on what characters you actually need to work with,
you might have better luck using one of the ISO8859 character set
locales.  Though if you actually need both Hebrew and Armenian
in the same DB, that suggestion is a nonstarter.

            regards, tom lane