Обсуждение: WIP: shared ispell dictionary

Поиск
Список
Период
Сортировка

WIP: shared ispell dictionary

От
Pavel Stehule
Дата:
Hello

attached patch add possibility to share ispell dictionary between
processes. The reason for this is the slowness of first tsearch query
and size of allocated memory per process. When I tested loading of
ispell dictionary (for Czech language) I got about 500 ms and 48MB.
With simple allocator it uses only 25 MB. If we remove some check and
tolower string transformation from loading stage it needs only 200 ms.
But with broken dict or affix file it can put wrong results. This
patch significantly reduce load on servers that use ispell
dictionaries.

I know so Tom worries about using of share memory. I think so it
unnecessarily. After loading data from dictionary are only read, never
modified. Second idea - this dictionary template can be distributed as
separate project (it needs a few changes in core - and simple
allocator).

Using:

a) set shared_data = 26MB (postgres.conf)
b) restart
c) register dictionary with option "share=yes"

CREATE TEXT SEARCH DICTIONARY cspell
   (template=ispell, dictfile = czech, afffile=czech, stopwords=czech,
share = yes);


[pavel@nemesis src]$ psql-dev3 postgres
Timing is on.
psql-dev3 (9.0devel)
Type "help" for help.

postgres=# select * from ts_debug('cs','Příliš žluťoučký kůň se napil
žluté vody');
   alias   |    description    |   token   |  dictionaries   |
dictionary |   lexemes
-----------+-------------------+-----------+-----------------+------------+-------------
 word      | Word, all letters | Příliš    | {cspell,simple} | cspell
   | {příliš}
 blank     | Space symbols     |           | {}              |            |
 word      | Word, all letters | žluťoučký | {cspell,simple} | cspell
   | {žluťoučký}
 blank     | Space symbols     |           | {}              |            |
 word      | Word, all letters | kůň       | {cspell,simple} | cspell
   | {kůň}
 blank     | Space symbols     |           | {}              |            |
 asciiword | Word, all ASCII   | se        | {cspell,simple} | cspell     | {}
 blank     | Space symbols     |           | {}              |            |
 asciiword | Word, all ASCII   | napil     | {cspell,simple} | cspell
   | {napít}
 blank     | Space symbols     |           | {}              |            |
 word      | Word, all letters | žluté     | {cspell,simple} | cspell
   | {žlutý}
 blank     | Space symbols     |           | {}              |            |
 asciiword | Word, all ASCII   | vody      | {cspell,simple} | cspell
   | {voda}
(13 rows)

Time: 8,178 ms  <<-- without patch 500ms

Limits and ToDo:
a) it support only simple regular expressions
b) it doesn't solve cache reset a shared memory deallocation

Regards
Pavel Stehule

Вложения

Re: WIP: shared ispell dictionary

От
Heikki Linnakangas
Дата:
Pavel Stehule wrote:
> attached patch add possibility to share ispell dictionary between
> processes. The reason for this is the slowness of first tsearch query
> and size of allocated memory per process. When I tested loading of
> ispell dictionary (for Czech language) I got about 500 ms and 48MB.
> With simple allocator it uses only 25 MB. If we remove some check and
> tolower string transformation from loading stage it needs only 200 ms.
> But with broken dict or affix file it can put wrong results. This
> patch significantly reduce load on servers that use ispell
> dictionaries.
> 
> I know so Tom worries about using of share memory. I think so it
> unnecessarily. After loading data from dictionary are only read, never
> modified. Second idea - this dictionary template can be distributed as
> separate project (it needs a few changes in core - and simple
> allocator).

Fixed-size shared memory blocks are always problematic. Would it be
possible to do the preloading with shared_preload_libraries somehow?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: WIP: shared ispell dictionary

От
Pavel Stehule
Дата:
2010/3/18 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> Pavel Stehule wrote:
>> attached patch add possibility to share ispell dictionary between
>> processes. The reason for this is the slowness of first tsearch query
>> and size of allocated memory per process. When I tested loading of
>> ispell dictionary (for Czech language) I got about 500 ms and 48MB.
>> With simple allocator it uses only 25 MB. If we remove some check and
>> tolower string transformation from loading stage it needs only 200 ms.
>> But with broken dict or affix file it can put wrong results. This
>> patch significantly reduce load on servers that use ispell
>> dictionaries.
>>
>> I know so Tom worries about using of share memory. I think so it
>> unnecessarily. After loading data from dictionary are only read, never
>> modified. Second idea - this dictionary template can be distributed as
>> separate project (it needs a few changes in core - and simple
>> allocator).
>
> Fixed-size shared memory blocks are always problematic. Would it be
> possible to do the preloading with shared_preload_libraries somehow?

Maybe. But there are some disadvantages: a) you have to copy
dictionary info to config, b) on some systems can be a problem lot of
memory per process (probably not on linux). Still you have to do some
bridge between tsearch cache and preloaded data.

Pavel

>
> --
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com
>


Re: WIP: shared ispell dictionary

От
Tom Lane
Дата:
Pavel Stehule <pavel.stehule@gmail.com> writes:
> I know so Tom worries about using of share memory.

You're right, and if I have any say in the matter no patch like this
will ever go in.

What I would suggest looking into is some way of preprocessing the raw
text dictionary file into a format that can be slurped into memory
quickly.  The main problem compared to the way things are done now
is that the current internal format relies heavily on pointers.
Maybe you could replace those by offsets?
        regards, tom lane


Re: WIP: shared ispell dictionary

От
Pavel Stehule
Дата:
2010/3/18 Tom Lane <tgl@sss.pgh.pa.us>:
> Pavel Stehule <pavel.stehule@gmail.com> writes:
>> I know so Tom worries about using of share memory.
>
> You're right, and if I have any say in the matter no patch like this
> will ever go in.
>
> What I would suggest looking into is some way of preprocessing the raw
> text dictionary file into a format that can be slurped into memory
> quickly.  The main problem compared to the way things are done now
> is that the current internal format relies heavily on pointers.
> Maybe you could replace those by offsets?

You have to maintain a new application :( There can be a new kind of bugs.

I playing with preload solution now. And I found a new issue.

I don't know why, but when I preload library with large mem like
ispell, then all next operations are ten times slower :(

[pavel@nemesis tsearch]$ psql-dev3 postgres
Timing is on.
psql-dev3 (9.0devel)
Type "help" for help.

postgres=# select 10;?column?
----------      10
(1 row)

Time: 0,611 ms
postgres=# select 10;?column?
----------      10
(1 row)

Time: 0,277 ms
postgres=# select 10;?column?
----------      10
(1 row)

Time: 0,266 ms
postgres=# select 10;?column?
----------      10
(1 row)

Time: 0,348 ms
postgres=# select * from ts_debug('cs','Jmenuji se Pavel Stěhule a
bydlím ve Skalici');  alias   |    description    |  token  |       dictionaries        |  dictionary    |     lexemes
-----------+-------------------+---------+---------------------------+------------------+----------------asciiword |
Word,all ASCII   | Jmenuji | {preloaded_cspell,simple} | 
preloaded_cspell | {jmenovat}blank     | Space symbols     |         | {}                        |
|asciiword| Word, all ASCII   | se      | {preloaded_cspell,simple} | 
preloaded_cspell | {}blank     | Space symbols     |         | {}                        |                |asciiword |
Word,all ASCII   | Pavel   | {preloaded_cspell,simple} | 
preloaded_cspell | {pavel,pavla}blank     | Space symbols     |         | {}                        |
|word     | Word, all letters | Stěhule | {preloaded_cspell,simple} | 
preloaded_cspell | {stěhule}blank     | Space symbols     |         | {}                        |
|asciiword| Word, all ASCII   | a       | {preloaded_cspell,simple} | 
preloaded_cspell | {}blank     | Space symbols     |         | {}                        |                |word      |
Word,all letters | bydlím  | {preloaded_cspell,simple} | 
preloaded_cspell | {bydlet,bydlit}blank     | Space symbols     |         | {}                        |
|asciiword| Word, all ASCII   | ve      | {preloaded_cspell,simple} | 
preloaded_cspell | {}blank     | Space symbols     |         | {}                        |                |asciiword |
Word,all ASCII   | Skalici | {preloaded_cspell,simple} | 
preloaded_cspell | {skalice}
(15 rows)

Time: 24,495 ms
postgres=# select * from ts_debug('cs','Jmenuji se Pavel Stěhule a
bydlím ve Skalici');  alias   |    description    |  token  |       dictionaries        |  dictionary    |     lexemes
-----------+-------------------+---------+---------------------------+------------------+----------------asciiword |
Word,all ASCII   | Jmenuji | {preloaded_cspell,simple} | 
preloaded_cspell | {jmenovat}blank     | Space symbols     |         | {}                        |
|asciiword| Word, all ASCII   | se      | {preloaded_cspell,simple} | 
preloaded_cspell | {}blank     | Space symbols     |         | {}                        |                |asciiword |
Word,all ASCII   | Pavel   | {preloaded_cspell,simple} | 
preloaded_cspell | {pavel,pavla}blank     | Space symbols     |         | {}                        |
|word     | Word, all letters | Stěhule | {preloaded_cspell,simple} | 
preloaded_cspell | {stěhule}blank     | Space symbols     |         | {}                        |
|asciiword| Word, all ASCII   | a       | {preloaded_cspell,simple} | 
preloaded_cspell | {}blank     | Space symbols     |         | {}                        |                |word      |
Word,all letters | bydlím  | {preloaded_cspell,simple} | 
preloaded_cspell | {bydlet,bydlit}blank     | Space symbols     |         | {}                        |
|asciiword| Word, all ASCII   | ve      | {preloaded_cspell,simple} | 
preloaded_cspell | {}blank     | Space symbols     |         | {}                        |                |asciiword |
Word,all ASCII   | Skalici | {preloaded_cspell,simple} | 
preloaded_cspell | {skalice}
(15 rows)

...skipping...  alias   |    description    |  token  |       dictionaries        |  dictionary    |     lexemes
-----------+-------------------+---------+---------------------------+------------------+----------------asciiword |
Word,all ASCII   | Jmenuji | {preloaded_cspell,simple} | 
preloaded_cspell | {jmenovat}blank     | Space symbols     |         | {}                        |
|asciiword| Word, all ASCII   | se      | {preloaded_cspell,simple} | 
preloaded_cspell | {}blank     | Space symbols     |         | {}                        |                |asciiword |
Word,all ASCII   | Pavel   | {preloaded_cspell,simple} | 
preloaded_cspell | {pavel,pavla}blank     | Space symbols     |         | {}                        |
|word     | Word, all letters | Stěhule | {preloaded_cspell,simple} | 
preloaded_cspell | {stěhule}blank     | Space symbols     |         | {}                        |
|asciiword| Word, all ASCII   | a       | {preloaded_cspell,simple} | 
preloaded_cspell | {}blank     | Space symbols     |         | {}                        |                |word      |
Word,all letters | bydlím  | {preloaded_cspell,simple} | 
preloaded_cspell | {bydlet,bydlit}blank     | Space symbols     |         | {}                        |
|asciiword| Word, all ASCII   | ve      | {preloaded_cspell,simple} | 
preloaded_cspell | {}blank     | Space symbols     |         | {}                        |                |asciiword |
Word,all ASCII   | Skalici | {preloaded_cspell,simple} | 
preloaded_cspell | {skalice}
(15 rows)

~
~
~
Time: 18,426 ms
postgres=# select 10;?column?
----------      10
(1 row)

Time: 12,700 ms
postgres=# select 10;?column?
----------      10
(1 row)

Time: 12,465 ms
postgres=# select 10;?column?
----------      10
(1 row)

Time: 12,603 ms
postgres=# select 10;?column?
----------      10
(1 row)

Time: 12,901 ms
postgres=# select 10;?column?
----------      10
(1 row)

Time: 12,642 ms

When I reduce memory with simple allocator, then this issue is
removed, but it is strange.

Pavel


>
>                        regards, tom lane
>


Re: WIP: shared ispell dictionary

От
Pavel Stehule
Дата:
2010/3/18 Pavel Stehule <pavel.stehule@gmail.com>:
> 2010/3/18 Tom Lane <tgl@sss.pgh.pa.us>:
>> Pavel Stehule <pavel.stehule@gmail.com> writes:
>>> I know so Tom worries about using of share memory.
>>
>> You're right, and if I have any say in the matter no patch like this
>> will ever go in.
>>
>> What I would suggest looking into is some way of preprocessing the raw
>> text dictionary file into a format that can be slurped into memory
>> quickly.  The main problem compared to the way things are done now
>> is that the current internal format relies heavily on pointers.
>> Maybe you could replace those by offsets?
>
> You have to maintain a new application :( There can be a new kind of bugs.
>
> I playing with preload solution now. And I found a new issue.
>
> I don't know why, but when I preload library with large mem like
> ispell, then all next operations are ten times slower :(
>

this strange issue is from very large memory context. When I don't
join tseach cached context with working context, then this issue
doesn't exists.

Datum
dpreloaddict_init(PG_FUNCTION_ARGS)
{

<------>if (prepd == NULL)
<------><------>return dispell_init(fcinfo);  // use without preloading
<------>else
<------>{
<------>
<------><------>//return PointerGetDatum(prepd);
<------><------>/*.
<------><------> * Add preload context to current conntext -- when
this code is active, then I have a issue
<------><------> */
<------><------>preload_ctx->parent = CurrentMemoryContext;
<------><------>preload_ctx->nextchild = CurrentMemoryContext->firstchild;
<------><------>CurrentMemoryContext->firstchild = preload_ctx;
<------><------>
<------><------>return PointerGetDatum(prepd);
<------>}
}

Pavel

>
> When I reduce memory with simple allocator, then this issue is
> removed, but it is strange.
>
> Pavel
>
>
>>
>>                        regards, tom lane
>>
>


Re: WIP: shared ispell dictionary

От
Pavel Stehule
Дата:
2010/3/18 Tom Lane <tgl@sss.pgh.pa.us>:
> Pavel Stehule <pavel.stehule@gmail.com> writes:
>> I know so Tom worries about using of share memory.
>
> You're right, and if I have any say in the matter no patch like this
> will ever go in.

I wrote second patch based on preloading. For real using it needs to
design parametrisation. It working well - on Linux. It is simple and
fast (with simple alloc). I am not sure about others systems.
Minimally it can exists as contrib module.

Pavel

>
> What I would suggest looking into is some way of preprocessing the raw
> text dictionary file into a format that can be slurped into memory
> quickly.  The main problem compared to the way things are done now
> is that the current internal format relies heavily on pointers.
> Maybe you could replace those by offsets?
>
>                        regards, tom lane
>

Вложения