experimental: TSearch dictionary [de]serialization

Поиск

Список

Период

Сортировка

От	Pavel Stehule
Тема	experimental: TSearch dictionary [de]serialization
Дата	31 августа 2010 г. 19:20:11
Msg-id	AANLkTinnim1joUog5bWsFW06uC4vVESZg6XoH40sbTSw@mail.gmail.com обсуждение исходный текст
Список	pgsql-hackers

Дерево обсуждения

Hello

I wrote a some very primitive code for testing serialization and de
serialization of TSearch ISpell dictionary. This code working - but it
is useful only for speed test now.

Czech fulltext dictionary is serialized to cca 9MB long file. Saving
needs about 90ms and reading needs same time.

 postgres=# select * from ts_debug('cs','příliš žluťoučký kůň se napil
žluté vody');
   alias   │    description    │   token   │  dictionaries   │
dictionary │   lexemes
───────────┼───────────────────┼───────────┼─────────────────┼────────────┼─────────────
 word      │ Word, all letters │ příliš    │ {cspell,simple} │ cspell
   │ {příliš}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 word      │ Word, all letters │ žluťoučký │ {cspell,simple} │ cspell
   │ {žluťoučký}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 word      │ Word, all letters │ kůň       │ {cspell,simple} │ cspell
   │ {kůň}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 asciiword │ Word, all ASCII   │ se        │ {cspell,simple} │ cspell     │ {}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 asciiword │ Word, all ASCII   │ napil     │ {cspell,simple} │ cspell
   │ {napít}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 word      │ Word, all letters │ žluté     │ {cspell,simple} │ cspell
   │ {žlutý}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 asciiword │ Word, all ASCII   │ vody      │ {cspell,simple} │ cspell
   │ {voda}
(13 rows)

Time: 92.708 ms -- with using a preprocessed dictionary

postgres=# select * from ts_debug('cs','příliš žluťoučký kůň se napil
žluté vody');
   alias   │    description    │   token   │  dictionaries   │
dictionary │   lexemes
───────────┼───────────────────┼───────────┼─────────────────┼────────────┼─────────────
 word      │ Word, all letters │ příliš    │ {cspell,simple} │ cspell
   │ {příliš}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 word      │ Word, all letters │ žluťoučký │ {cspell,simple} │ cspell
   │ {žluťoučký}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 word      │ Word, all letters │ kůň       │ {cspell,simple} │ cspell
   │ {kůň}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 asciiword │ Word, all ASCII   │ se        │ {cspell,simple} │ cspell     │ {}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 asciiword │ Word, all ASCII   │ napil     │ {cspell,simple} │ cspell
   │ {napít}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 word      │ Word, all letters │ žluté     │ {cspell,simple} │ cspell
   │ {žlutý}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 asciiword │ Word, all ASCII   │ vody      │ {cspell,simple} │ cspell
   │ {voda}
(13 rows)

Time: 3.758 ms -- standard time (dictionary is loaded)

postgres=# select * from ts_debug('cs','příliš žluťoučký kůň se napil
žluté vody');
   alias   │    description    │   token   │  dictionaries   │
dictionary │   lexemes
───────────┼───────────────────┼───────────┼─────────────────┼────────────┼─────────────
 word      │ Word, all letters │ příliš    │ {cspell,simple} │ cspell
   │ {příliš}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 word      │ Word, all letters │ žluťoučký │ {cspell,simple} │ cspell
   │ {žluťoučký}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 word      │ Word, all letters │ kůň       │ {cspell,simple} │ cspell
   │ {kůň}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 asciiword │ Word, all ASCII   │ se        │ {cspell,simple} │ cspell     │ {}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 asciiword │ Word, all ASCII   │ napil     │ {cspell,simple} │ cspell
   │ {napít}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 word      │ Word, all letters │ žluté     │ {cspell,simple} │ cspell
   │ {žlutý}
 blank     │ Space symbols     │           │ {}              │ [null]
   │ [null]
 asciiword │ Word, all ASCII   │ vody      │ {cspell,simple} │ cspell
   │ {voda}
(13 rows)

Time: 518.528 ms --- typical first evaluation time

So using a preprocessed file helps - the time of first processing is
about 4x better. But still this time is 20x slower than using a loaded
dictionary. I found a one issue - I am not able to serialize a full
regexp. Czech dictionary doesn't use it, so I didn't solve this task.
I would to like implement a few hooks to ISpellDictionary to be
possible implement own memory management for ispell dictionaries. I
understand to problems with shared memory or mmap - but I don't see
any different way, than use a third party mmap support. This module
must not be in core - probably this is only local Czech (and maybe
Japan) problem.

Regards

Pavel Stehule

Вложения

ft02.diff

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

experimental: TSearch dictionary [de]serialization

Вложения