Re: pg full text search very slow for Chinese characters

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: pg full text search very slow for Chinese characters
Дата
Msg-id 20190911.113417.69552735.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: pg full text search very slow for Chinese characters  (Andreas Joseph Krogh <andreas@visena.com>)
Список pgsql-general
Hi.

At Tue, 10 Sep 2019 18:42:26 +0200 (CEST), Andreas Joseph Krogh <andreas@visena.com> wrote in
<VisenaEmail.3.8750116fce15432e.16d1c0b2b28@tc7-visena>
> På tirsdag 10. september 2019 kl. 18:21:45, skrev Tom Lane <tgl@sss.pgh.pa.us
> <mailto:tgl@sss.pgh.pa.us>>: Jimmy Huang <jimmy_huang@live.com> writes:
>  > I tried pg_trgm and my own customized token parser
> https://github.com/huangjimmy/pg_cjk_parser
>
>  pg_trgm is going to be fairly useless for indexing text that's mostly
>  multibyte characters, since its unit of indexable data is just 3 bytes
>  (not characters). I don't know of any comparable issue in the core
>  tsvector logic, though. The numbers you're quoting do sound quite awful,
>  but I share Cory's suspicion that it's something about your setup rather
>  than an inherent Postgres issue.
>
>  regards, tom lane We experienced quite awful performance when we hosted the
> DB on virtual servers (~5 years ago) and it turned out we hit the write-cache
> limit (then 8GB), which resulted in ~1MB/s IO thruput. Running iozone might
> help tracing down IO-problems. --
>  Andreas Joseph Krogh

Multibyte characters also quickly bloats index by many many small
buckets for every 3-characters combination of thouhsand of
characters, which makes it useless.

pg_bigm based on bigram/2-gram works better on multibyte
characters.

https://pgbigm.osdn.jp/index_en.html

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center



В списке pgsql-general по дате отправления:

Предыдущее
От: Adrian Klaver
Дата:
Сообщение: Re: kind of a bag of attributes in a DB . . .
Следующее
От: Nicola Contu
Дата:
Сообщение: ERROR: too many dynamic shared memory segments