Why can I not get lexemes for Hebrew but can get them for Armenian?

Поиск

Список

Период

Сортировка

От	Sam Saffron
Тема	Why can I not get lexemes for Hebrew but can get them for Armenian?
Дата	27 февраля 2019 г. 10:11:37
Msg-id	CAAtdryM4vrD+XEOho7me4pH7qHN=DpjF6QFe1BJXFgAQkHE3nA@mail.gmail.com обсуждение исходный текст
Ответы	Re: Why can I not get lexemes for Hebrew but can get them for Armenian?
Список	pgsql-general

Дерево обсуждения

(This is a cross post from Stack Exchange, not getting much traction there)

On my Mac install of PG:

```
=# select to_tsvector('english', 'abcd สวัสดี');
 to_tsvector
-------------
 'abcd':1
(1 row)

=# select * from ts_debug('hello สวัสดี');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
 blank     | Space symbols   |  สวัสดี | {}             |              |
(2 rows)
```

On my Linux install of PG:

```
=# select to_tsvector('english', 'abcd สวัสดี');
    to_tsvector
-------------------
 'abcd':1 'สวัสดี':2
(1 row)

=# select * from ts_debug('hello สวัสดี');
   alias   |    description    | token |  dictionaries  |  dictionary  | lexemes
-----------+-------------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII   | hello | {english_stem} | english_stem | {hello}
 blank     | Space symbols     |       | {}             |              |
 word      | Word, all letters | สวัสดี  | {english_stem} |
english_stem | {สวัสดี}
(3 rows)

```

So something is clearly different about the way the tokenisation is
defined in PG. My question is, how do I figure out what is different
and how do I make my mac install of PG work like the Linux one?

On both installs:

```
# SHOW default_text_search_config;
 default_text_search_config
----------------------------
 pg_catalog.english
(1 row)

# show lc_ctype;
  lc_ctype
-------------
 en_US.UTF-8
(1 row)
```

So somehow this mac install thinks that thai letters are spaces... how
do I debug this and fix the "Space Symbol" definition here.

Interestingly this install works with Armenian, but falls over when we
reach Hebrew

```
=# select * from ts_debug('ԵԵԵ');
 alias |    description    | token |  dictionaries  |  dictionary  | lexemes
-------+-------------------+-------+----------------+--------------+---------
 word  | Word, all letters | ԵԵԵ   | {english_stem} | english_stem | {եեե}
(1 row)

=# select * from ts_debug('אאא');
 alias |  description  | token | dictionaries | dictionary | lexemes
-------+---------------+-------+--------------+------------+---------
 blank | Space symbols | אאא   | {}           |            |
(1 row)
```

В списке pgsql-general по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Why can I not get lexemes for Hebrew but can get them for Armenian?