Обсуждение: Questions regarding contrib/tsearch
Hello! I have installed tsearch for several of our databases (PostgreSQL 7.2.1 in SuSE Linux 7.3, 4xPIIIXeon550, 2GB RAM, RAID5 /w 5x18GB) and it's really working like a charm - much more flexible than contrib/fti and faster on inserts, too. Documentation still lacks a bit on explanation about what exactly is happening - the fti-mechanism was quite easy to grasp, the workings of tsearch seem more like a black box to me. But as long as it does work, and it does work very well indeed, you won't find me complaining :) I have got two questions, though. First: Has somebody implemented this with a dictionary for German already? I imagine that it could by a bit difficult because of plural of words with diphtongs in it tend to transform one vowel into an umlaut ("Haus"->"Häuser", "Maus"->"Mäuse" etc.) and there's lots of more complicated grammar which doesn't confine itself to changing a suffix... So german morphology cannot be quite as easily described in an algorithm. And the umlaut-vowels and the sharp-s are buggering me anyway, because we store them in HTML-transcription in the database; therefore a search for "Fähre" (fähre = ferry) must be written as "fä&hre" in the searchstring - and if there is just one word containing an O-umlaut (ö, ö) in that row, a search for "Föhre" (searchstring: "fö&hre", the word is german for "pine tree") or "führe" (searchstring: "fü&hre", german subjunctive of the verb "to drive") would lead to a hit, even though these words don't appear in the indexed text at all. I can live with this tiny inaccuracy because we've got a website about games and 99% of our searches are for game-titles (and near enough all of them are in English, hardly any of them contain an umlaut), but I'd be interested in your experiences with this issue and how you resolved it - or if you just ignored it, too :) My second question is about performance - I think I know the answer, but I'd like to know if I'm correct. I've got a table containing forum-messages with nearly 500,000 rows; the tsearch-indexed fields can currently contain as much as 5,000 characters per row (we plan on allowing about 12,000 characters in the near future), the field that contains the txtidx-data is named textindex. Now I start a search for messages containing the words 'Dungeon' and 'Siege': First time search for 'dungeon&siege': community=# explain analyze select count (*) from ct_com_board_message where textindex ## 'dungeon&siege'; NOTICE: QUERY PLAN: Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual time=24913.37..24913.38 rows=1 loops=1) -> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18 rows=487 width=0) (actual time=14.62..24899.67 rows=2647 loops=1) Total runtime: 24913.60 msec Second time, same search: community=# explain analyze select count (*) from ct_com_board_message where textindex ## 'dungeon&siege'; NOTICE: QUERY PLAN: Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual time=415.66..415.66 rows=1 loops=1) -> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18 rows=487 width=0) (actual time=0.12..407.31 rows=2647 loops=1) Total runtime: 415.88 msec Just for curiosity - number of results of this query: community=# select count (*) from ct_com_board_message where textindex ## 'dungeon&siege'; count ------- 2647 (1 row) Right now the database is not in full production state, as our main website is not yet running on it - and this site is causing more than 90% of our overall database-traffic. This is the reason I need to get as much performance as I can out of the searching, before switching this last remaining site to PostgreSQL, too. I suspect that the high running time for the first call of that query is due to the database having to do harddisk-access in order to get the needed parts of the table into memory. This would explain the acceptably low running time of the second call - the information needed is already in memory, so there's no slow harddisk-access involved and the query is completed quite quickly. Is this correct? If so, what can I do to have all of the database in memory? The machine has got 2 GB of RAM and if I dump all the databases into one sql-script, the resulting file is about 600MB in size. Shurely it should be possible to keep most of that in memory at all times? What would I need to do to accomplish this? And is there a way to get even more detailed information about query execution like how much time is needed for the query-plan, for hd-access and so on? Regards, Markus Wollny
On Fri, 2 Aug 2002, Markus Wollny wrote: > Hello! > > I have installed tsearch for several of our databases (PostgreSQL 7.2.1 > in SuSE Linux 7.3, 4xPIIIXeon550, 2GB RAM, RAID5 /w 5x18GB) and it's > really working like a charm - much more flexible than contrib/fti and > faster on inserts, too. Documentation still lacks a bit on explanation > about what exactly is happening - the fti-mechanism was quite easy to > grasp, the workings of tsearch seem more like a black box to me. But as > long as it does work, and it does work very well indeed, you won't find > me complaining :) Good news > > I have got two questions, though. First: Has somebody implemented this > with a dictionary for German already? I imagine that it could by a bit > difficult because of plural of words with diphtongs in it tend to > transform one vowel into an umlaut ("Haus"->"HДuser", "Maus"->"MДuse" > etc.) and there's lots of more complicated grammar which doesn't confine > itself to changing a suffix... So german morphology cannot be quite as > easily described in an algorithm. And the umlaut-vowels and the sharp-s > are buggering me anyway, because we store them in HTML-transcription in > the database; therefore a search for "FДhre" (fähre = ferry) must > be written as "fä&hre" in the searchstring - and if there is just > one word containing an O-umlaut (ö, Ж) in that row, a search for > "FЖhre" (searchstring: "fö&hre", the word is german for "pine tree") > or "fЭhre" (searchstring: "fü&hre", german subjunctive of the verb > "to drive") would lead to a hit, even though these words don't appear in > the indexed text at all. I can live with this tiny inaccuracy because > we've got a website about games and 99% of our searches are for > game-titles (and near enough all of them are in English, hardly any of > them contain an umlaut), but I'd be interested in your experiences with > this issue and how you resolved it - or if you just ignored it, too :) Marcus, OpenFTS uses contrib/tsearch as a based data type and is (currently) much flexible in respect of language support. Particularly, it has support of Snowball stemmers (http://snowball.sourceforge.net/). So, in principle it should work with German. We have plan to continue our work on tsearch, but currently we're quite busy. You may try to write snowball interface to tsearch yourself. > > My second question is about performance - I think I know the answer, but > I'd like to know if I'm correct. I've got a table containing > forum-messages with nearly 500,000 rows; the tsearch-indexed fields can > currently contain as much as 5,000 characters per row (we plan on > allowing about 12,000 characters in the near future), the field that > contains the txtidx-data is named textindex. Now I start a search for > messages containing the words 'Dungeon' and 'Siege': > > First time search for 'dungeon&siege': > community=# explain analyze select count (*) from ct_com_board_message > where textindex ## 'dungeon&siege'; > NOTICE: QUERY PLAN: > Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual > time=24913.37..24913.38 rows=1 loops=1) > -> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18 > rows=487 width=0) (actual time=14.62..24899.67 rows=2647 loops=1) > Total runtime: 24913.60 msec > > Second time, same search: > community=# explain analyze select count (*) from ct_com_board_message > where textindex ## 'dungeon&siege'; > NOTICE: QUERY PLAN: > Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual > time=415.66..415.66 rows=1 loops=1) > -> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18 > rows=487 width=0) (actual time=0.12..407.31 rows=2647 loops=1) > Total runtime: 415.88 msec > > Just for curiosity - number of results of this query: > community=# select count (*) from ct_com_board_message where textindex > ## 'dungeon&siege'; > count > ------- > 2647 > (1 row) > > Right now the database is not in full production state, as our main > website is not yet running on it - and this site is causing more than > 90% of our overall database-traffic. This is the reason I need to get as > much performance as I can out of the searching, before switching this > last remaining site to PostgreSQL, too. I suspect that the high running > time for the first call of that query is due to the database having to > do harddisk-access in order to get the needed parts of the table into > memory. This would explain the acceptably low running time of the second > call - the information needed is already in memory, so there's no slow > harddisk-access involved and the query is completed quite quickly. Is > this correct? If so, what can I do to have all of the database in > memory? The machine has got 2 GB of RAM and if I dump all the databases > into one sql-script, the resulting file is about 600MB in size. Shurely > it should be possible to keep most of that in memory at all times? What > would I need to do to accomplish this? And is there a way to get even > more detailed information about query execution like how much time is > needed for the query-plan, for hd-access and so on? > You're right ! The more data, the bigger index. There were several threads in -hackers list about optimizing postgresql (shared memory, buffers ...). Your database certainly should fits 2Gb But there are several tsearch specific recommendations: 1. Use morphology, stemming and stop words. This could greatly reduce size of index ! All postgresql documentation is consist of about 8 Kb distinct words after stemming. I don't remember if I used stop words. 2. Use OpenFTS for now. It's much more flexible and also provides ranking of search results. Also, it's possible to specify different dictionaries for different languages, specify what type of lexemes to index (currently it recognizes 23 classes) and this is also a big win, because tsearch (currently) indexes everything ! But users usually need just words + some kind of numbers. We're about to release 0.33 version of OpenFTS which is based on tsearch module. I've written "The Crash-course to OpenFTS" , you may read it currently on our page (http://www.sai.msu.su/~megera/postgres/gist/). > Regards, > > Markus Wollny > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
"Markus Wollny" <Markus.Wollny@computec.de> writes: > ... I suspect that the high running > time for the first call of that query is due to the database having to > do harddisk-access in order to get the needed parts of the table into > memory. This would explain the acceptably low running time of the second > call - the information needed is already in memory, so there's no slow > harddisk-access involved and the query is completed quite quickly. Is > this correct? Yup, that's my interpretation as well. > If so, what can I do to have all of the database in memory? Buy enough RAM to hold it ;-) If the database is being accessed heavily then it will tend to remain swapped in; you don't have to (and really can't) do anything to tweak the kernel-level and Postgres-level algorithms that determine this. What you want is to ensure there's enough RAM to hold not only all the database hotspots, but also all the other programs and working data that the server machine will be running. Check the actual size-on-disk of the tables and indexes you would like to be resident. (Do a vacuum, then look at pg_class.relpages for these items. See http://developer.postgresql.org/docs/postgres/diskusage.html for more info.) I would allow about 10MB of RAM per server process, plus a healthy chunk for the kernel and other programs. Also, it's probably best not to go overboard on shared_buffers in this scenario. You want the tables to stay resident in kernel disk cache, not necessarily in Postgres shared buffers. regards, tom lane
On Fri, Aug 02, 2002 at 09:39:54AM -0400, Tom Lane wrote: > > If the database is being accessed heavily then it will tend to remain > swapped in; you don't have to (and really can't) do anything to tweak > the kernel-level and Postgres-level algorithms that determine this. > What you want is to ensure there's enough RAM to hold not only all the > database hotspots, but also all the other programs and working data > that the server machine will be running. I was wondering: is there an in-principle reason that there isn't any mechanism for locking a table in memory, or is it just that no-one has ever done it? A -- ---- Andrew Sullivan 87 Mowat Avenue Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M6K 3E3 +1 416 646 3304 x110
Andrew Sullivan <andrew@libertyrms.info> writes: > On Fri, Aug 02, 2002 at 09:39:54AM -0400, Tom Lane wrote: >> If the database is being accessed heavily then it will tend to remain >> swapped in; you don't have to (and really can't) do anything to tweak >> the kernel-level and Postgres-level algorithms that determine this. >> What you want is to ensure there's enough RAM to hold not only all the >> database hotspots, but also all the other programs and working data >> that the server machine will be running. > I was wondering: is there an in-principle reason that there isn't any > mechanism for locking a table in memory, or is it just that no-one > has ever done it? Why would you? If the table is being heavily accessed then it will stay in memory. If it gets dropped from memory then the memory was needed for something else that's more heavily used at the moment. I'll grant you that buffer management algorithms are not perfect, but I really doubt that "lock this table in core" is an improvement over letting the system do its thing. regards, tom lane
On Fri, Aug 02, 2002 at 03:30:18PM -0400, Tom Lane wrote: > > I was wondering: is there an in-principle reason that there isn't any > > mechanism for locking a table in memory, or is it just that no-one > > has ever done it? > > Why would you? If the table is being heavily accessed then it will stay > in memory. If it gets dropped from memory then the memory was needed > for something else that's more heavily used at the moment. I tend to agree with this, but I can imagine a case where a machine is actually being asked to do more than it should. In that case, you might decide that you want to preserve the performance on certain tables, at the expense of the overall system. I don't even know whether that would work; it's just the standard answer I get when I ask Oracle guys whether Oracle's memory management is so bad that they have to lock tables in memory. ;-) A -- ---- Andrew Sullivan 87 Mowat Avenue Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M6K 3E3 +1 416 646 3304 x110
Tom Lane wrote: > "Markus Wollny" <Markus.Wollny@computec.de> writes: > > ... I suspect that the high running > > time for the first call of that query is due to the database having to > > do harddisk-access in order to get the needed parts of the table into > > memory. This would explain the acceptably low running time of the second > > call - the information needed is already in memory, so there's no slow > > harddisk-access involved and the query is completed quite quickly. Is > > this correct? > > Yup, that's my interpretation as well. Also, is there any mention in the tsearch documentation about clustering the index? Does that help performance. I know it helped in other full-text indexes. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Hi! > > If so, what can I do to have all of the database in memory? > > Buy enough RAM to hold it ;-) > > If the database is being accessed heavily then it will tend to remain > swapped in; you don't have to (and really can't) do anything to tweak > the kernel-level and Postgres-level algorithms that determine this. > What you want is to ensure there's enough RAM to hold not only all the > database hotspots, but also all the other programs and working data > that the server machine will be running. > > Check the actual size-on-disk of the tables and indexes you would like > to be resident. (Do a vacuum, then look at pg_class.relpages > for these > items. See > http://developer.postgresql.org/docs/postgres/diskusage.html > for more info.) > > I would allow about 10MB of RAM per server process, plus a > healthy chunk > for the kernel and other programs. Is that 10MB per process on top of total database size + shared_buffers? In my case that would be roughly 1024MB database size + 2560 MB for processes (256 max.) + 256 MB for shared_buffers, I think. Or did I misunderstand you? Because in real life operation, RAM doesn't seem to be a problem, as there's hardly any swap-activity and most of the available RAM is used by system cache. Regards, Markus