Обсуждение: WIP: shared ispell dictionary
Hello attached patch add possibility to share ispell dictionary between processes. The reason for this is the slowness of first tsearch query and size of allocated memory per process. When I tested loading of ispell dictionary (for Czech language) I got about 500 ms and 48MB. With simple allocator it uses only 25 MB. If we remove some check and tolower string transformation from loading stage it needs only 200 ms. But with broken dict or affix file it can put wrong results. This patch significantly reduce load on servers that use ispell dictionaries. I know so Tom worries about using of share memory. I think so it unnecessarily. After loading data from dictionary are only read, never modified. Second idea - this dictionary template can be distributed as separate project (it needs a few changes in core - and simple allocator). Using: a) set shared_data = 26MB (postgres.conf) b) restart c) register dictionary with option "share=yes" CREATE TEXT SEARCH DICTIONARY cspell (template=ispell, dictfile = czech, afffile=czech, stopwords=czech, share = yes); [pavel@nemesis src]$ psql-dev3 postgres Timing is on. psql-dev3 (9.0devel) Type "help" for help. postgres=# select * from ts_debug('cs','Příliš žluťoučký kůň se napil žluté vody'); alias | description | token | dictionaries | dictionary | lexemes -----------+-------------------+-----------+-----------------+------------+------------- word | Word, all letters | Příliš | {cspell,simple} | cspell | {příliš} blank | Space symbols | | {} | | word | Word, all letters | žluťoučký | {cspell,simple} | cspell | {žluťoučký} blank | Space symbols | | {} | | word | Word, all letters | kůň | {cspell,simple} | cspell | {kůň} blank | Space symbols | | {} | | asciiword | Word, all ASCII | se | {cspell,simple} | cspell | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | napil | {cspell,simple} | cspell | {napít} blank | Space symbols | | {} | | word | Word, all letters | žluté | {cspell,simple} | cspell | {žlutý} blank | Space symbols | | {} | | asciiword | Word, all ASCII | vody | {cspell,simple} | cspell | {voda} (13 rows) Time: 8,178 ms <<-- without patch 500ms Limits and ToDo: a) it support only simple regular expressions b) it doesn't solve cache reset a shared memory deallocation Regards Pavel Stehule
Вложения
Pavel Stehule wrote: > attached patch add possibility to share ispell dictionary between > processes. The reason for this is the slowness of first tsearch query > and size of allocated memory per process. When I tested loading of > ispell dictionary (for Czech language) I got about 500 ms and 48MB. > With simple allocator it uses only 25 MB. If we remove some check and > tolower string transformation from loading stage it needs only 200 ms. > But with broken dict or affix file it can put wrong results. This > patch significantly reduce load on servers that use ispell > dictionaries. > > I know so Tom worries about using of share memory. I think so it > unnecessarily. After loading data from dictionary are only read, never > modified. Second idea - this dictionary template can be distributed as > separate project (it needs a few changes in core - and simple > allocator). Fixed-size shared memory blocks are always problematic. Would it be possible to do the preloading with shared_preload_libraries somehow? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
2010/3/18 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>: > Pavel Stehule wrote: >> attached patch add possibility to share ispell dictionary between >> processes. The reason for this is the slowness of first tsearch query >> and size of allocated memory per process. When I tested loading of >> ispell dictionary (for Czech language) I got about 500 ms and 48MB. >> With simple allocator it uses only 25 MB. If we remove some check and >> tolower string transformation from loading stage it needs only 200 ms. >> But with broken dict or affix file it can put wrong results. This >> patch significantly reduce load on servers that use ispell >> dictionaries. >> >> I know so Tom worries about using of share memory. I think so it >> unnecessarily. After loading data from dictionary are only read, never >> modified. Second idea - this dictionary template can be distributed as >> separate project (it needs a few changes in core - and simple >> allocator). > > Fixed-size shared memory blocks are always problematic. Would it be > possible to do the preloading with shared_preload_libraries somehow? Maybe. But there are some disadvantages: a) you have to copy dictionary info to config, b) on some systems can be a problem lot of memory per process (probably not on linux). Still you have to do some bridge between tsearch cache and preloaded data. Pavel > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com >
Pavel Stehule <pavel.stehule@gmail.com> writes: > I know so Tom worries about using of share memory. You're right, and if I have any say in the matter no patch like this will ever go in. What I would suggest looking into is some way of preprocessing the raw text dictionary file into a format that can be slurped into memory quickly. The main problem compared to the way things are done now is that the current internal format relies heavily on pointers. Maybe you could replace those by offsets? regards, tom lane
2010/3/18 Tom Lane <tgl@sss.pgh.pa.us>: > Pavel Stehule <pavel.stehule@gmail.com> writes: >> I know so Tom worries about using of share memory. > > You're right, and if I have any say in the matter no patch like this > will ever go in. > > What I would suggest looking into is some way of preprocessing the raw > text dictionary file into a format that can be slurped into memory > quickly. The main problem compared to the way things are done now > is that the current internal format relies heavily on pointers. > Maybe you could replace those by offsets? You have to maintain a new application :( There can be a new kind of bugs. I playing with preload solution now. And I found a new issue. I don't know why, but when I preload library with large mem like ispell, then all next operations are ten times slower :( [pavel@nemesis tsearch]$ psql-dev3 postgres Timing is on. psql-dev3 (9.0devel) Type "help" for help. postgres=# select 10;?column? ---------- 10 (1 row) Time: 0,611 ms postgres=# select 10;?column? ---------- 10 (1 row) Time: 0,277 ms postgres=# select 10;?column? ---------- 10 (1 row) Time: 0,266 ms postgres=# select 10;?column? ---------- 10 (1 row) Time: 0,348 ms postgres=# select * from ts_debug('cs','Jmenuji se Pavel Stěhule a bydlím ve Skalici'); alias | description | token | dictionaries | dictionary | lexemes -----------+-------------------+---------+---------------------------+------------------+----------------asciiword | Word,all ASCII | Jmenuji | {preloaded_cspell,simple} | preloaded_cspell | {jmenovat}blank | Space symbols | | {} | |asciiword| Word, all ASCII | se | {preloaded_cspell,simple} | preloaded_cspell | {}blank | Space symbols | | {} | |asciiword | Word,all ASCII | Pavel | {preloaded_cspell,simple} | preloaded_cspell | {pavel,pavla}blank | Space symbols | | {} | |word | Word, all letters | Stěhule | {preloaded_cspell,simple} | preloaded_cspell | {stěhule}blank | Space symbols | | {} | |asciiword| Word, all ASCII | a | {preloaded_cspell,simple} | preloaded_cspell | {}blank | Space symbols | | {} | |word | Word,all letters | bydlím | {preloaded_cspell,simple} | preloaded_cspell | {bydlet,bydlit}blank | Space symbols | | {} | |asciiword| Word, all ASCII | ve | {preloaded_cspell,simple} | preloaded_cspell | {}blank | Space symbols | | {} | |asciiword | Word,all ASCII | Skalici | {preloaded_cspell,simple} | preloaded_cspell | {skalice} (15 rows) Time: 24,495 ms postgres=# select * from ts_debug('cs','Jmenuji se Pavel Stěhule a bydlím ve Skalici'); alias | description | token | dictionaries | dictionary | lexemes -----------+-------------------+---------+---------------------------+------------------+----------------asciiword | Word,all ASCII | Jmenuji | {preloaded_cspell,simple} | preloaded_cspell | {jmenovat}blank | Space symbols | | {} | |asciiword| Word, all ASCII | se | {preloaded_cspell,simple} | preloaded_cspell | {}blank | Space symbols | | {} | |asciiword | Word,all ASCII | Pavel | {preloaded_cspell,simple} | preloaded_cspell | {pavel,pavla}blank | Space symbols | | {} | |word | Word, all letters | Stěhule | {preloaded_cspell,simple} | preloaded_cspell | {stěhule}blank | Space symbols | | {} | |asciiword| Word, all ASCII | a | {preloaded_cspell,simple} | preloaded_cspell | {}blank | Space symbols | | {} | |word | Word,all letters | bydlím | {preloaded_cspell,simple} | preloaded_cspell | {bydlet,bydlit}blank | Space symbols | | {} | |asciiword| Word, all ASCII | ve | {preloaded_cspell,simple} | preloaded_cspell | {}blank | Space symbols | | {} | |asciiword | Word,all ASCII | Skalici | {preloaded_cspell,simple} | preloaded_cspell | {skalice} (15 rows) ...skipping... alias | description | token | dictionaries | dictionary | lexemes -----------+-------------------+---------+---------------------------+------------------+----------------asciiword | Word,all ASCII | Jmenuji | {preloaded_cspell,simple} | preloaded_cspell | {jmenovat}blank | Space symbols | | {} | |asciiword| Word, all ASCII | se | {preloaded_cspell,simple} | preloaded_cspell | {}blank | Space symbols | | {} | |asciiword | Word,all ASCII | Pavel | {preloaded_cspell,simple} | preloaded_cspell | {pavel,pavla}blank | Space symbols | | {} | |word | Word, all letters | Stěhule | {preloaded_cspell,simple} | preloaded_cspell | {stěhule}blank | Space symbols | | {} | |asciiword| Word, all ASCII | a | {preloaded_cspell,simple} | preloaded_cspell | {}blank | Space symbols | | {} | |word | Word,all letters | bydlím | {preloaded_cspell,simple} | preloaded_cspell | {bydlet,bydlit}blank | Space symbols | | {} | |asciiword| Word, all ASCII | ve | {preloaded_cspell,simple} | preloaded_cspell | {}blank | Space symbols | | {} | |asciiword | Word,all ASCII | Skalici | {preloaded_cspell,simple} | preloaded_cspell | {skalice} (15 rows) ~ ~ ~ Time: 18,426 ms postgres=# select 10;?column? ---------- 10 (1 row) Time: 12,700 ms postgres=# select 10;?column? ---------- 10 (1 row) Time: 12,465 ms postgres=# select 10;?column? ---------- 10 (1 row) Time: 12,603 ms postgres=# select 10;?column? ---------- 10 (1 row) Time: 12,901 ms postgres=# select 10;?column? ---------- 10 (1 row) Time: 12,642 ms When I reduce memory with simple allocator, then this issue is removed, but it is strange. Pavel > > regards, tom lane >
2010/3/18 Pavel Stehule <pavel.stehule@gmail.com>: > 2010/3/18 Tom Lane <tgl@sss.pgh.pa.us>: >> Pavel Stehule <pavel.stehule@gmail.com> writes: >>> I know so Tom worries about using of share memory. >> >> You're right, and if I have any say in the matter no patch like this >> will ever go in. >> >> What I would suggest looking into is some way of preprocessing the raw >> text dictionary file into a format that can be slurped into memory >> quickly. The main problem compared to the way things are done now >> is that the current internal format relies heavily on pointers. >> Maybe you could replace those by offsets? > > You have to maintain a new application :( There can be a new kind of bugs. > > I playing with preload solution now. And I found a new issue. > > I don't know why, but when I preload library with large mem like > ispell, then all next operations are ten times slower :( > this strange issue is from very large memory context. When I don't join tseach cached context with working context, then this issue doesn't exists. Datum dpreloaddict_init(PG_FUNCTION_ARGS) { <------>if (prepd == NULL) <------><------>return dispell_init(fcinfo); // use without preloading <------>else <------>{ <------> <------><------>//return PointerGetDatum(prepd); <------><------>/*. <------><------> * Add preload context to current conntext -- when this code is active, then I have a issue <------><------> */ <------><------>preload_ctx->parent = CurrentMemoryContext; <------><------>preload_ctx->nextchild = CurrentMemoryContext->firstchild; <------><------>CurrentMemoryContext->firstchild = preload_ctx; <------><------> <------><------>return PointerGetDatum(prepd); <------>} } Pavel > > When I reduce memory with simple allocator, then this issue is > removed, but it is strange. > > Pavel > > >> >> regards, tom lane >> >
2010/3/18 Tom Lane <tgl@sss.pgh.pa.us>: > Pavel Stehule <pavel.stehule@gmail.com> writes: >> I know so Tom worries about using of share memory. > > You're right, and if I have any say in the matter no patch like this > will ever go in. I wrote second patch based on preloading. For real using it needs to design parametrisation. It working well - on Linux. It is simple and fast (with simple alloc). I am not sure about others systems. Minimally it can exists as contrib module. Pavel > > What I would suggest looking into is some way of preprocessing the raw > text dictionary file into a format that can be slurped into memory > quickly. The main problem compared to the way things are done now > is that the current internal format relies heavily on pointers. > Maybe you could replace those by offsets? > > regards, tom lane >