Обсуждение: full text search to_tsquery performance with ispell dictionary

Поиск
Список
Период
Сортировка

full text search to_tsquery performance with ispell dictionary

От
Stanislav Raskin
Дата:
Hello everybody,

I was experimenting with the FTS feature on postgres 8.3.4 lately and encountered a weird performance issue when using a custom FTS configuration.

I use this german ispell dictionary, re-encoded to utf8:


With the following configuration:

CREATE TEXT SEARCH CONFIGURATION public.german_de (COPY = pg_catalog.german);


CREATE TEXT SEARCH DICTIONARY german_de_ispell (

    TEMPLATE = ispell,

    DictFile = german_de_utf8,

    AffFile = german_de_utf8,

    StopWords = german_de_utf8

);


ALTER TEXT SEARCH CONFIGURATION german_de

    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,

                      word, hword, hword_part

    WITH german_de_ispell, german_stem;



So far so good. Indexing and creation of tsvectors works like a charm.

The problem is, that if I open a new connection to the database and do something like this

SELECT to_tsquery('german_de', 'abcd');

it takes A LOT of time for the query to complete for the first time. About 1-1,5s. If I submit the same query for a second, third, fourth time and so on, it takes only some 10-20ms, which is what I would expect.

It almost seems as if the dictionary is somehow analyzed or indexed and the results cached for each connection, which seems counter-intuitive to me. After all, the dictionaries should not change that often.

Did I miss something or did I do something wrong?

I'd be thankful for any advice.

Kind Regards

--

Stanislav Raskin

livn GmbH
Campus Freudenberg
Rainer-Gruenter-Str. 21
42119 Wuppertal

+49(0)202-8 50 66 921
raskin@livn.de
http://www.livn.de

livn
local individual video news GmbH
Registergericht Wuppertal HRB 20086

Geschäftsführer:
Dr. Stefan Brües
Alexander Jacob

Re: full text search to_tsquery performance with ispell dictionary

От
Pavel Stehule
Дата:
Hello

2011/5/11 Stanislav Raskin <raskin@livn.de>:
> Hello everybody,
> I was experimenting with the FTS feature on postgres 8.3.4 lately and
> encountered a weird performance issue when using a custom FTS configuration.
> I use this german ispell dictionary, re-encoded to utf8:
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz
> With the following configuration:
>
> CREATE TEXT SEARCH CONFIGURATION public.german_de (COPY =
> pg_catalog.german);
>
> CREATE TEXT SEARCH DICTIONARY german_de_ispell (
>
>     TEMPLATE = ispell,
>
>     DictFile = german_de_utf8,
>
>     AffFile = german_de_utf8,
>
>     StopWords = german_de_utf8
>
> );
>
> ALTER TEXT SEARCH CONFIGURATION german_de
>
>     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>
>                       word, hword, hword_part
>
>     WITH german_de_ispell, german_stem;
>
> So far so good. Indexing and creation of tsvectors works like a charm.
> The problem is, that if I open a new connection to the database and do
> something like this
> SELECT to_tsquery('german_de', 'abcd');
> it takes A LOT of time for the query to complete for the first time. About
> 1-1,5s. If I submit the same query for a second, third, fourth time and so
> on, it takes only some 10-20ms, which is what I would expect.
> It almost seems as if the dictionary is somehow analyzed or indexed and the
> results cached for each connection, which seems counter-intuitive to me.
> After all, the dictionaries should not change that often.
> Did I miss something or did I do something wrong?
> I'd be thankful for any advice.
> Kind Regards

it is expected behave :( . A loading of ispell dictionary is very slow.

Use a german snowball instead.

You can you a some pooling connection software too.

Regards

Pavel Stehule

> --
>
> Stanislav Raskin
>
> livn GmbH
> Campus Freudenberg
> Rainer-Gruenter-Str. 21
> 42119 Wuppertal
>
> +49(0)202-8 50 66 921
> raskin@livn.de
> http://www.livn.de
>
> livn
> local individual video news GmbH
> Registergericht Wuppertal HRB 20086
>
> Geschäftsführer:
> Dr. Stefan Brües
> Alexander Jacob

Re: full text search to_tsquery performance with ispell dictionary

От
Tom Lane
Дата:
Stanislav Raskin <raskin@livn.de> writes:
> The problem is, that if I open a new connection to the database and do
> something like this
> SELECT to_tsquery('german_de', 'abcd');
> it takes A LOT of time for the query to complete for the first time. About
> 1-1,5s. If I submit the same query for a second, third, fourth time and so
> on, it takes only some 10-20ms, which is what I would expect.

> It almost seems as if the dictionary is somehow analyzed or indexed and the
> results cached for each connection, which seems counter-intuitive to me.

Yes, loading a large dictionary is known to be a fairly expensive
operation.  There's been discussions about how to make it cheaper, but
nothing's been done yet.

            regards, tom lane

Re: full text search to_tsquery performance with ispell dictionary

От
Pavel Stehule
Дата:
Hello

2011/5/11 Stanislav Raskin <raskin@livn.de>:
>
> On 11.05.11 15:45, "Pavel Stehule" <pavel.stehule@gmail.com> wrote:
>
>>it is expected behave :( . A loading of ispell dictionary is very slow.
>>
>>Use a german snowball instead.
>>
>>You can you a some pooling connection software too.
>
>
> Thank you for the response.
> Is the dictionary german_stem supplied with postgresql a snowball stemmer?
> If yes, it sure is incredibly fast, but yields much worse results and thus
> fewer and worse matches for search queries.
>
> To use connections pooling is...difficult in my situation, to say the
> least. We currently use quite a complex pgcluster/corosync setup for
> multi-master replication, load balancing and high availability. To
> introduce connection pooling to this setup could turn out to be quite a
> big project.
>

German_stem is part of distribution. I am thinking so result of stems
are usable because the reports about slow speed are not often.

There are not exists Czech stem, so we have to use a ispell. I wrote a
patch that stores loaded dictionary in shared memory. You can find
source code in archive pg_hacker mailing list. But it isn't well
tested and it is just prototype - not accepted to pg. You can test it.
Sometimes people use a >>simple<< configuration here. It isn't best
but it is fast.

Regards

Pavel Stehule

> --
>
> Stanislav Raskin
>
>
>
>

Re: full text search to_tsquery performance with ispell dictionary

От
Stanislav Raskin
Дата:
>
>
>
>Yes, loading a large dictionary is known to be a fairly expensive
>operation.  There's been discussions about how to make it cheaper, but
>nothing's been done yet.
>
>            regards, tom lane

Hi Tom,

thanks for the quick response. Bad news for me ;(
We develop ajax-driven web apps, which sort of rely on quick calls to data
services. Each call to a service opens a new connection. This makes the
search service, if using fts and ispell, about 100 times slower than a
"dumb" ILIKE-implementation.

Is there any way of hack or compromise to achieve good performance without
losing fts ability?
I am thinking, for example, of a way to permanently keep a loaded
dictionary in memory instead of loading it for every connection. As I
wrote in response to Pavel Stehule's post, connection pooling is not
really an option.
Our front-end is strictly PHP, so I was thinking about using a single
persistent connection
(http://de.php.net/manual/en/function.pg-pconnect.php) for all calls. Is
there some sort of major disadvantage in this approach from the database
point of view?

Kind regards

--

Stanislav Raskin



Re: full text search to_tsquery performance with ispell dictionary

От
Stanislav Raskin
Дата:
On 11.05.11 15:45, "Pavel Stehule" <pavel.stehule@gmail.com> wrote:

>it is expected behave :( . A loading of ispell dictionary is very slow.
>
>Use a german snowball instead.
>
>You can you a some pooling connection software too.


Thank you for the response.
Is the dictionary german_stem supplied with postgresql a snowball stemmer?
If yes, it sure is incredibly fast, but yields much worse results and thus
fewer and worse matches for search queries.

To use connections pooling is...difficult in my situation, to say the
least. We currently use quite a complex pgcluster/corosync setup for
multi-master replication, load balancing and high availability. To
introduce connection pooling to this setup could turn out to be quite a
big project.

--

Stanislav Raskin




Re: full text search to_tsquery performance with ispell dictionary

От
tv@fuzzy.cz
Дата:
>>
>>
>>
>>Yes, loading a large dictionary is known to be a fairly expensive
>>operation.  There's been discussions about how to make it cheaper, but
>>nothing's been done yet.
>>
>>            regards, tom lane
>
> Hi Tom,
>
> thanks for the quick response. Bad news for me ;(
> We develop ajax-driven web apps, which sort of rely on quick calls to data
> services. Each call to a service opens a new connection. This makes the
> search service, if using fts and ispell, about 100 times slower than a
> "dumb" ILIKE-implementation.
>
> Is there any way of hack or compromise to achieve good performance without
> losing fts ability?
> I am thinking, for example, of a way to permanently keep a loaded
> dictionary in memory instead of loading it for every connection. As I
> wrote in response to Pavel Stehule's post, connection pooling is not
> really an option.
> Our front-end is strictly PHP, so I was thinking about using a single
> persistent connection
> (http://de.php.net/manual/en/function.pg-pconnect.php) for all calls. Is
> there some sort of major disadvantage in this approach from the database
> point of view?
>
> Kind regards

Hi,

opening a completely new connection for each request may be a bit
expensive, so I'd recommend using some king od connection pooling,
especially when you're doing 'small' transactions (because that's when the
overhead matters).

We had exactly the same problem and persistent connection solved it.

But it has some drawbacks too - each conneection has it's own copy of the
dictionary. So if the dictionary takes 30MB and you have 10 connections,
then 300 MB of memory is used.

regards
Tomas


Re: full text search to_tsquery performance with ispell dictionary

От
Pavel Stehule
Дата:
2011/5/11 Stanislav Raskin <raskin@livn.de>:
> On 11.05.11 16:42, "Pavel Stehule" <pavel.stehule@gmail.com> wrote:
>
>
>>I wrote a
>>patch that stores loaded dictionary in shared memory.
>
> Hi Pavel,
>
> very interesting. I will give it a closer look.
>
> What do you think about using ispell to create, store and index tsvectors,
> but at the same time to use the stemmer to create tsqueries? Thus, I hope
> to create a fast user experience while searching, but at the same time
> have more detailed tsvectors and more matches.
>

I know nothing - but there can be risk - two different method can to
returns two different results.

Regards

Pavel

> Regards
>
> --
>
> Stanislav Raskin
>>
>
>
>

Re: full text search to_tsquery performance with ispell dictionary

От
Tom Lane
Дата:
Stanislav Raskin <raskin@livn.de> writes:
> Is there any way of hack or compromise to achieve good performance without
> losing fts ability?
> I am thinking, for example, of a way to permanently keep a loaded
> dictionary in memory instead of loading it for every connection. As I
> wrote in response to Pavel Stehule's post, connection pooling is not
> really an option.

You really, really, really need to fix whatever is preventing you from
using pooling.  Opening a database connection to run one query is just
horridly inefficient.  Loading a dictionary has evidently raised it past
the threshold where you couldn't ignore it anymore, but that doesn't
mean you weren't paying through the nose before.  A postgres backend
just isn't a lightweight object --- there is all sorts of stuff it has
to cache before it's really up to speed.

            regards, tom lane

Re: full text search to_tsquery performance with ispell dictionary

От
Stanislav Raskin
Дата:
On 11.05.11 16:42, "Pavel Stehule" <pavel.stehule@gmail.com> wrote:


>I wrote a
>patch that stores loaded dictionary in shared memory.

Hi Pavel,

very interesting. I will give it a closer look.

What do you think about using ispell to create, store and index tsvectors,
but at the same time to use the stemmer to create tsqueries? Thus, I hope
to create a fast user experience while searching, but at the same time
have more detailed tsvectors and more matches.

Regards

--

Stanislav Raskin
>



Re: full text search to_tsquery performance with ispell dictionary

От
Oleg Bartunov
Дата:
On Wed, 11 May 2011, Stanislav Raskin wrote:

>>
>>
>>
>> Yes, loading a large dictionary is known to be a fairly expensive
>> operation.  There's been discussions about how to make it cheaper, but
>> nothing's been done yet.
>>
>>            regards, tom lane
>
> Hi Tom,
>
> thanks for the quick response. Bad news for me ;(
> We develop ajax-driven web apps, which sort of rely on quick calls to data
> services. Each call to a service opens a new connection. This makes the
> search service, if using fts and ispell, about 100 times slower than a
> "dumb" ILIKE-implementation.
>
> Is there any way of hack or compromise to achieve good performance without
> losing fts ability?
> I am thinking, for example, of a way to permanently keep a loaded
> dictionary in memory instead of loading it for every connection. As I
> wrote in response to Pavel Stehule's post, connection pooling is not
> really an option.
> Our front-end is strictly PHP, so I was thinking about using a single
> persistent connection
> (http://de.php.net/manual/en/function.pg-pconnect.php) for all calls. Is
> there some sort of major disadvantage in this approach from the database
> point of view?
>
> Kind regards
>
> --
>
> Stanislav Raskin
>
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: full text search to_tsquery performance with ispell dictionary

От
Stanislav Raskin
Дата:
On 11.05.11 17:04, "tv@fuzzy.cz" <tv@fuzzy.cz> wrote:

>We had exactly the same problem and persistent connection solved it.

First testing with persistent connections seems to work like a charm. Will
do some thorough testing and watch the memory load. Hopefully, I will not
trip over some sort of pitfall. Goole seems to be full of people who have
problems with persistent connections.

Big thanks for your advice.



Re: full text search to_tsquery performance with ispell dictionary

От
Stanislav Raskin
Дата:

On 11.05.11 17:31, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

>You really, really, really need to fix whatever is preventing you from
>using pooling.  Opening a database connection to run one query is just
>horridly inefficient.


Very true. I did not mean that anything actually prevents us from using
pooling. We just have no idea, how it will interfere with our productive
pgcluster setup. I imagine the evaluation, testing and verification of
pooling systems in combination with our setup to be quite tedious.

Of course, I don't open a connection for each query. One is opened for
each service call. The services are designed to start an own process for
every call. Such a process - for now - needs an own connection. It usually
handles dozens of queries. Until now, we never ran into performance
problems. The time consumed by DB operations is usually negligible
compared to the rest.

First tests with a simple persistent connection setup seem to work fine
and solve the performance issue. I tend to put some thorough testing on
this setup and see if I step into a pitfall.

Big kudos to you and this list. You were a great help, as always.