Обсуждение: Questions regarding contrib/tsearch

Поиск
Список
Период
Сортировка

Questions regarding contrib/tsearch

От
"Markus Wollny"
Дата:
Hello!

I have installed tsearch for several of our databases (PostgreSQL 7.2.1
in SuSE Linux 7.3, 4xPIIIXeon550, 2GB RAM, RAID5 /w 5x18GB) and it's
really working like a charm - much more flexible than contrib/fti and
faster on inserts, too. Documentation still lacks a bit on explanation
about what exactly is happening - the fti-mechanism was quite easy to
grasp, the workings of tsearch seem more like a black box to me. But as
long as it does work, and it does work very well indeed, you won't find
me complaining :)

I have got two questions, though. First: Has somebody implemented this
with a dictionary for German already? I imagine that it could by a bit
difficult because of plural of words with diphtongs in it tend to
transform one vowel into an umlaut ("Haus"->"Häuser", "Maus"->"Mäuse"
etc.) and there's lots of more complicated grammar which doesn't confine
itself to changing a suffix... So german morphology cannot be quite as
easily described in an algorithm. And the umlaut-vowels and the sharp-s
are buggering me anyway, because we store them in HTML-transcription in
the database; therefore a search for "Fähre" (fähre = ferry) must
be written as "fä&hre" in the searchstring - and if there is just
one word containing an O-umlaut (ö, ö) in that row, a search for
"Föhre" (searchstring: "fö&hre", the word is german for "pine tree")
or "führe" (searchstring: "fü&hre", german subjunctive of the verb
"to drive") would lead to a hit, even though these words don't appear in
the indexed text at all. I can live with this tiny inaccuracy because
we've got a website about games and 99% of our searches are for
game-titles (and near enough all of them are in English, hardly any of
them contain an umlaut), but I'd be interested in your experiences with
this issue and how you resolved it - or if you just ignored it, too :)

My second question is about performance - I think I know the answer, but
I'd like to know if I'm correct. I've got a table containing
forum-messages with nearly 500,000 rows; the tsearch-indexed fields can
currently contain as much as 5,000 characters per row (we plan on
allowing about 12,000 characters in the near future), the field that
contains the txtidx-data is named textindex. Now I start a search for
messages containing the words 'Dungeon' and 'Siege':

First time search for 'dungeon&siege':
community=# explain analyze select count (*) from ct_com_board_message
where textindex ## 'dungeon&siege';
NOTICE: QUERY PLAN:
Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual
time=24913.37..24913.38 rows=1 loops=1)
-> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18
rows=487 width=0) (actual time=14.62..24899.67 rows=2647 loops=1)
Total runtime: 24913.60 msec

Second time, same search:
community=# explain analyze select count (*) from ct_com_board_message
where textindex ## 'dungeon&siege';
NOTICE: QUERY PLAN:
Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual
time=415.66..415.66 rows=1 loops=1)
-> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18
rows=487 width=0) (actual time=0.12..407.31 rows=2647 loops=1)
Total runtime: 415.88 msec

Just for curiosity - number of results of this query:
community=# select count (*) from ct_com_board_message where textindex
## 'dungeon&siege';
count
-------
2647
(1 row)

Right now the database is not in full production state, as our main
website is not yet running on it - and this site is causing more than
90% of our overall database-traffic. This is the reason I need to get as
much performance as I can out of the searching, before switching this
last remaining site to PostgreSQL, too. I suspect that the high running
time for the first call of that query is due to the database having to
do harddisk-access in order to get the needed parts of the table into
memory. This would explain the acceptably low running time of the second
call - the information needed is already in memory, so there's no slow
harddisk-access involved and the query is completed quite quickly. Is
this correct? If so, what can I do to have all of the database in
memory? The machine has got 2 GB of RAM and if I dump all the databases
into one sql-script, the resulting file is about 600MB in size. Shurely
it should be possible to keep most of that in memory at all times? What
would I need to do to accomplish this? And is there a way to get even
more detailed information about query execution like how much time is
needed for the query-plan, for hd-access and so on?

Regards,

    Markus Wollny

Re: Questions regarding contrib/tsearch

От
Oleg Bartunov
Дата:
On Fri, 2 Aug 2002, Markus Wollny wrote:

> Hello!
>
> I have installed tsearch for several of our databases (PostgreSQL 7.2.1
> in SuSE Linux 7.3, 4xPIIIXeon550, 2GB RAM, RAID5 /w 5x18GB) and it's
> really working like a charm - much more flexible than contrib/fti and
> faster on inserts, too. Documentation still lacks a bit on explanation
> about what exactly is happening - the fti-mechanism was quite easy to
> grasp, the workings of tsearch seem more like a black box to me. But as
> long as it does work, and it does work very well indeed, you won't find
> me complaining :)

Good news
>
> I have got two questions, though. First: Has somebody implemented this
> with a dictionary for German already? I imagine that it could by a bit
> difficult because of plural of words with diphtongs in it tend to
> transform one vowel into an umlaut ("Haus"->"HДuser", "Maus"->"MДuse"
> etc.) and there's lots of more complicated grammar which doesn't confine
> itself to changing a suffix... So german morphology cannot be quite as
> easily described in an algorithm. And the umlaut-vowels and the sharp-s
> are buggering me anyway, because we store them in HTML-transcription in
> the database; therefore a search for "FДhre" (fähre = ferry) must
> be written as "fä&hre" in the searchstring - and if there is just
> one word containing an O-umlaut (ö, Ж) in that row, a search for
> "FЖhre" (searchstring: "fö&hre", the word is german for "pine tree")
> or "fЭhre" (searchstring: "fü&hre", german subjunctive of the verb
> "to drive") would lead to a hit, even though these words don't appear in
> the indexed text at all. I can live with this tiny inaccuracy because
> we've got a website about games and 99% of our searches are for
> game-titles (and near enough all of them are in English, hardly any of
> them contain an umlaut), but I'd be interested in your experiences with
> this issue and how you resolved it - or if you just ignored it, too :)


Marcus, OpenFTS uses contrib/tsearch as a based data type and is
(currently) much flexible in respect of language support. Particularly,
it has support of Snowball stemmers (http://snowball.sourceforge.net/).
So, in principle it should work with German. We have plan to continue
our work on tsearch, but currently we're quite busy. You may try
to write snowball interface to tsearch yourself.


>
> My second question is about performance - I think I know the answer, but
> I'd like to know if I'm correct. I've got a table containing
> forum-messages with nearly 500,000 rows; the tsearch-indexed fields can
> currently contain as much as 5,000 characters per row (we plan on
> allowing about 12,000 characters in the near future), the field that
> contains the txtidx-data is named textindex. Now I start a search for
> messages containing the words 'Dungeon' and 'Siege':
>
> First time search for 'dungeon&siege':
> community=# explain analyze select count (*) from ct_com_board_message
> where textindex ## 'dungeon&siege';
> NOTICE: QUERY PLAN:
> Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual
> time=24913.37..24913.38 rows=1 loops=1)
> -> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18
> rows=487 width=0) (actual time=14.62..24899.67 rows=2647 loops=1)
> Total runtime: 24913.60 msec
>
> Second time, same search:
> community=# explain analyze select count (*) from ct_com_board_message
> where textindex ## 'dungeon&siege';
> NOTICE: QUERY PLAN:
> Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual
> time=415.66..415.66 rows=1 loops=1)
> -> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18
> rows=487 width=0) (actual time=0.12..407.31 rows=2647 loops=1)
> Total runtime: 415.88 msec
>
> Just for curiosity - number of results of this query:
> community=# select count (*) from ct_com_board_message where textindex
> ## 'dungeon&siege';
> count
> -------
> 2647
> (1 row)
>
> Right now the database is not in full production state, as our main
> website is not yet running on it - and this site is causing more than
> 90% of our overall database-traffic. This is the reason I need to get as
> much performance as I can out of the searching, before switching this
> last remaining site to PostgreSQL, too. I suspect that the high running
> time for the first call of that query is due to the database having to
> do harddisk-access in order to get the needed parts of the table into
> memory. This would explain the acceptably low running time of the second
> call - the information needed is already in memory, so there's no slow
> harddisk-access involved and the query is completed quite quickly. Is
> this correct? If so, what can I do to have all of the database in
> memory? The machine has got 2 GB of RAM and if I dump all the databases
> into one sql-script, the resulting file is about 600MB in size. Shurely
> it should be possible to keep most of that in memory at all times? What
> would I need to do to accomplish this? And is there a way to get even
> more detailed information about query execution like how much time is
> needed for the query-plan, for hd-access and so on?
>

You're right ! The more data, the bigger index. There were several threads
in -hackers list about optimizing postgresql (shared memory, buffers ...).
Your database certainly should fits 2Gb

But there are several tsearch specific recommendations:

1. Use morphology, stemming and stop words.  This could greatly reduce
   size of index ! All postgresql documentation is consist of about
   8 Kb distinct words after stemming. I don't remember if I used
   stop words.

2. Use OpenFTS for now. It's much more flexible and also provides
   ranking of search results. Also, it's possible to specify
   different dictionaries for different languages, specify what type
   of lexemes to index (currently it recognizes 23 classes) and this
   is also a big win, because tsearch (currently) indexes everything !
   But users usually need just words + some kind of numbers.

   We're about to release 0.33 version of OpenFTS which is based on tsearch
   module. I've written "The Crash-course to OpenFTS" , you may read it
   currently on our page (http://www.sai.msu.su/~megera/postgres/gist/).


> Regards,
>
>     Markus Wollny
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: Questions regarding contrib/tsearch

От
Tom Lane
Дата:
"Markus Wollny" <Markus.Wollny@computec.de> writes:
> ... I suspect that the high running
> time for the first call of that query is due to the database having to
> do harddisk-access in order to get the needed parts of the table into
> memory. This would explain the acceptably low running time of the second
> call - the information needed is already in memory, so there's no slow
> harddisk-access involved and the query is completed quite quickly. Is
> this correct?

Yup, that's my interpretation as well.

> If so, what can I do to have all of the database in memory?

Buy enough RAM to hold it ;-)

If the database is being accessed heavily then it will tend to remain
swapped in; you don't have to (and really can't) do anything to tweak
the kernel-level and Postgres-level algorithms that determine this.
What you want is to ensure there's enough RAM to hold not only all the
database hotspots, but also all the other programs and working data
that the server machine will be running.

Check the actual size-on-disk of the tables and indexes you would like
to be resident.  (Do a vacuum, then look at pg_class.relpages for these
items.  See http://developer.postgresql.org/docs/postgres/diskusage.html
for more info.)

I would allow about 10MB of RAM per server process, plus a healthy chunk
for the kernel and other programs.

Also, it's probably best not to go overboard on shared_buffers in this
scenario.  You want the tables to stay resident in kernel disk cache,
not necessarily in Postgres shared buffers.

            regards, tom lane

Re: Questions regarding contrib/tsearch

От
Andrew Sullivan
Дата:
On Fri, Aug 02, 2002 at 09:39:54AM -0400, Tom Lane wrote:
>
> If the database is being accessed heavily then it will tend to remain
> swapped in; you don't have to (and really can't) do anything to tweak
> the kernel-level and Postgres-level algorithms that determine this.
> What you want is to ensure there's enough RAM to hold not only all the
> database hotspots, but also all the other programs and working data
> that the server machine will be running.

I was wondering: is there an in-principle reason that there isn't any
mechanism for locking a table in memory, or is it just that no-one
has ever done it?

A

--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110


Re: Questions regarding contrib/tsearch

От
Tom Lane
Дата:
Andrew Sullivan <andrew@libertyrms.info> writes:
> On Fri, Aug 02, 2002 at 09:39:54AM -0400, Tom Lane wrote:
>> If the database is being accessed heavily then it will tend to remain
>> swapped in; you don't have to (and really can't) do anything to tweak
>> the kernel-level and Postgres-level algorithms that determine this.
>> What you want is to ensure there's enough RAM to hold not only all the
>> database hotspots, but also all the other programs and working data
>> that the server machine will be running.

> I was wondering: is there an in-principle reason that there isn't any
> mechanism for locking a table in memory, or is it just that no-one
> has ever done it?

Why would you?  If the table is being heavily accessed then it will stay
in memory.  If it gets dropped from memory then the memory was needed
for something else that's more heavily used at the moment.

I'll grant you that buffer management algorithms are not perfect,
but I really doubt that "lock this table in core" is an improvement
over letting the system do its thing.

            regards, tom lane

Re: Questions regarding contrib/tsearch

От
Andrew Sullivan
Дата:
On Fri, Aug 02, 2002 at 03:30:18PM -0400, Tom Lane wrote:
> > I was wondering: is there an in-principle reason that there isn't any
> > mechanism for locking a table in memory, or is it just that no-one
> > has ever done it?
>
> Why would you?  If the table is being heavily accessed then it will stay
> in memory.  If it gets dropped from memory then the memory was needed
> for something else that's more heavily used at the moment.

I tend to agree with this, but I can imagine a case where a machine
is actually being asked to do more than it should.  In that case, you
might decide that you want to preserve the performance on certain
tables, at the expense of the overall system.  I don't even know
whether that would work; it's just the standard answer I get when I
ask Oracle guys whether Oracle's memory management is so bad that
they have to lock tables in memory. ;-)

A

--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110


Re: Questions regarding contrib/tsearch

От
Bruce Momjian
Дата:
Tom Lane wrote:
> "Markus Wollny" <Markus.Wollny@computec.de> writes:
> > ... I suspect that the high running
> > time for the first call of that query is due to the database having to
> > do harddisk-access in order to get the needed parts of the table into
> > memory. This would explain the acceptably low running time of the second
> > call - the information needed is already in memory, so there's no slow
> > harddisk-access involved and the query is completed quite quickly. Is
> > this correct?
>
> Yup, that's my interpretation as well.

Also, is there any mention in the tsearch documentation about clustering
the index?  Does that help performance.  I know it helped in other
full-text indexes.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Questions regarding contrib/tsearch

От
"Markus Wollny"
Дата:
Hi!

> > If so, what can I do to have all of the database in memory?
>
> Buy enough RAM to hold it ;-)
>
> If the database is being accessed heavily then it will tend to remain
> swapped in; you don't have to (and really can't) do anything to tweak
> the kernel-level and Postgres-level algorithms that determine this.
> What you want is to ensure there's enough RAM to hold not only all the
> database hotspots, but also all the other programs and working data
> that the server machine will be running.
>
> Check the actual size-on-disk of the tables and indexes you would like
> to be resident.  (Do a vacuum, then look at pg_class.relpages
> for these
> items.  See
> http://developer.postgresql.org/docs/postgres/diskusage.html
> for more info.)
>
> I would allow about 10MB of RAM per server process, plus a
> healthy chunk
> for the kernel and other programs.

Is that 10MB per process on top of total database size + shared_buffers?

In my case that would be roughly 1024MB database size + 2560 MB for
processes (256 max.) + 256 MB for shared_buffers, I think. Or did I
misunderstand you? Because in real life operation, RAM doesn't seem to
be a problem, as there's hardly any swap-activity and most of the
available RAM is used by system cache.

Regards,

    Markus