Re: patch: preload dictionary new version

Поиск
Список
Период
Сортировка
От Pavel Stehule
Тема Re: patch: preload dictionary new version
Дата
Msg-id AANLkTimtqGjiAmmIhoJ5ANJo7eSH4W8YA0TnBgwG52Mf@mail.gmail.com
обсуждение исходный текст
Ответ на Re: patch: preload dictionary new version  (Pavel Stehule <pavel.stehule@gmail.com>)
Список pgsql-hackers
Hello

I found a page http://www.genesys-e.org/jwalter//mix4win.htm where is
section >>Emulation of mmap/munmap<<. Can be a solution?

Regards

Pavel Stehule

2010/7/8 Pavel Stehule <pavel.stehule@gmail.com>:
> Hello
>
> 2010/7/8 Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp>:
>>
>> Pavel Stehule <pavel.stehule@gmail.com> wrote:
>>
>>> this version has enhanced AllocSet allocator - it can use a  mmap API.
>>
>> I review your patch and will report some comments. However, I don't have
>> test cases for the patch because there is no large dictionaries in the
>> default postgres installation. I'd like to ask you to supply test data
>> for the patch.
>
> you can use a Czech dictionary - please, download it from
> http://www.pgsql.cz/data/czech.tar.gz
>
> CREATE TEXT SEARCH DICTIONARY cspell
>   (template=ispell, dictfile = czech, afffile=czech, stopwords=czech);
> CREATE TEXT SEARCH CONFIGURATION cs (copy=english);
> ALTER TEXT SEARCH CONFIGURATION cs
>   ALTER MAPPING FOR word, asciiword WITH cspell, simple;
>
> postgres=# select * from ts_debug('cs','Příliš žluťoučký kůň se napil
> žluté vody');
>   alias   |    description    |   token   |  dictionaries   |
> dictionary |   lexemes
> -----------+-------------------+-----------+-----------------+------------+-------------
>  word      | Word, all letters | Příliš    | {cspell,simple} | cspell
>   | {příliš}
>  blank     | Space symbols     |           | {}              |            |
>  word      | Word, all letters | žluťoučký | {cspell,simple} | cspell
>   | {žluťoučký}
>  blank     | Space symbols     |           | {}              |            |
>  word      | Word, all letters | kůň       | {cspell,simple} | cspell
>   | {kůň}
>  blank     | Space symbols     |           | {}              |            |
>  asciiword | Word, all ASCII   | se        | {cspell,simple} | cspell     | {}
>  blank     | Space symbols     |           | {}              |            |
>  asciiword | Word, all ASCII   | napil     | {cspell,simple} | cspell
>   | {napít}
>  blank     | Space symbols     |           | {}              |            |
>  word      | Word, all letters | žluté     | {cspell,simple} | cspell
>   | {žlutý}
>  blank     | Space symbols     |           | {}              |            |
>  asciiword | Word, all ASCII   | vody      | {cspell,simple} | cspell
>   | {voda}
>
>
>>
>> This patch allocates memory with non-file-based mmap() to preload text search
>> dictionary files at the server start. Note that dist files are not mmap'ed
>> directly in the patch; mmap() is used for reallocatable shared memory.
>>
>> The dictinary loader is also modified a bit to use simple_alloc() instead
>> of palloc() for long-lived cache. It can reduce calls of AllocSetAlloc(),
>> that have some overheads to support pfree(). Since the cache is never
>> released, simple_alloc() seems to have better performance than palloc().
>> Note that the optimization will also work for non-preloaded dicts.
>
> it produce little bit better spead, but mainly it significant memory
> reduction - palloc allocation is expensive, because add 4 bytes (8
> bytes) to any allocations. And it is problem for thousands smalls
> blocks like TSearch ispell dictionary uses. On 64 bit the overhead is
> horrible
>
>>
>> === Questions ===
>> - How do backends share the dict cache? You might expect postmaster's
>>  catalog is inherited to backends with fork(), but we don't use fork()
>>  on Windows.
>>
>
> I though about some variants
> a) using a shared memory - but it needs more shared memory
> reservation, maybe some GUC - but this variant was refused in
> discussion.
> b) using a mmap on Unix and CreateFileMapping API on windows - but it
> is little bit problem for me. I am not have a develop tools for ms
> windows. And I don't understand to MS Win platform :(
>
> Magnus, can you do some tip?
>
> Without MSWindows we don't need to solve a shared memory and can use
> only fork. If we can think about MSWin too, then we have to calculate
> only with some shared memory based solution. But it has more
> possibilities - shared dictionary can be loaded in runtime too.
>
>> - Why are SQL functions dpreloaddict_init() and dpreloaddict_lexize()
>>  defined but not used?
>
> it is used, if I remember well. It uses ispell dictionary API. The
> using is simlyfied - you can parametrize preload dictionary - and then
> you use a preloaded dictionary - not some specific dictionary. This
> has one advantage and one disadvantage + very simple configuration, +
> there are not necessary some shared dictionary manager, - only one
> preload dictionary can be used.
>
>
>>
>> === Design ===
>> - You added 3 custom parameters (dict_preload.dictfile/afffile/stopwords),
>>  but I think text search configuration names is better than file names.
>>  However, it requires system catalog access but we cannot access any
>>  catalog at the moment of preloading. If config-name-based setting is
>>  difficult, we need to write docs about where we can get the dict names
>>  to be preloaded instead. (from \dFd+ ?)
>>
>
> yes - it is true argument - there are not possible access to these
> data in preloaded time. I would to support preloading - (and possible
> support sharing session loaded dictionaries), because it ensure a
> constant time for TSearch queries everytime. Yes, some documentation,
> some enhancing of dictionary list info can be solution.
>
>> - Do we need to support multiple preloaded dicts? I think dict_preload.*
>>  should accept a list of items to be loaded. GUC_LIST_INPUT will be a help.
>>
>
> maybe yes. Personaly I would not to complicate a design and using. And
> I don't know about request for multiple preloaded dicts now. The
> preloaded dictionaries interface is only server side matter - so it
> can be changed/enhanced later without problems. I have a idea about
> enhancig a GUC parser to allow some like
>
> preload_dictionary.patch = ...
> preload_dictionary.czech = (template=ispell, dictfile = czech,
> afffile=czech, stopwords=czech)
> proload_dictionary.japan = (template=.....
>
>
>> - Server doesn't start when I added dict_preload to
>>  shared_preload_libraries and didn't add any custom parameters.
>>    FATAL:  missing AffFile parameter
>>  But server should start with no effects or print WARNING messages
>>  for "no dicts are preloaded" in such case.
>>
>> - We could replace simple_alloc() to a new MemoryContextMethods that
>>  doesn't support pfree() but has better performance. It doesn't look
>>  ideal for me to implement simple_alloc() on the top of palloc().
>>
>
> I don't agree. palloc API is designed to be general - so I implemented
> a new memory context type via MMapAllocSetContextCreate and then I use
> a palloc function. There isn't reason to design a some new API.
>
>> === Implementation ===
>> I'm sure that your patch is WIP, but I'll note some issues just in case.
>>
>> - We need Makefile for contrib/dict_preload.
>
> sure, sorry
>
>>
>> - mmap() is not always portable. We should check the availability
>>  in configure, and also have an alternative implementation for Win32.
>
> yes, it have to be first step. I need a established API for simple
> allocation. Maybe divide this patch to two independent patches - and
> to solve memory allocation first ? Dictionary preloading isn't complex
> or large feature - so it can be handled in every commitfest. Memory
> management is more importal, and can be handled first.
>
>>
>>
>> Regards,
>> ---
>> Takahiro Itagaki
>> NTT Open Source Software Center
>>
>
> Thank You very much for review
>
> Pavel Stehule
>
>>
>>
>


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Fujii Masao
Дата:
Сообщение: Re: keepalive in libpq using
Следующее
От: Robert Haas
Дата:
Сообщение: Re: patch: preload dictionary new version