Обсуждение: FTS performance with the Polish config
Hello I just finished implementing a "search engine" for my site and found ts_headline extremely slow when used with a Polish tsearch configuration, while fast with English. All of it boils down to a simple testcase, but first some background. I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0 (2.6.21), then switched both installations to 8.3.8 (both packages compiled, but provided by the distro - port/emerge). The Polish dictionaries and config were created according to this article (it's in Polish, but the code is self-explanatory): http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/ Now for the testcase: text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.' # explain analyze select ts_headline('polish', text, plainto_tsquery('polish', 'foobar')); QUERY PLAN ------------------------------------------------------------------------------------ Result (cost=0.00..0.01 rows=1 width=0) (actual time=6.407..6.470 rows=1 loops=1) Total runtime: 6.524 ms (2 rows) # explain analyze select ts_headline('english', text, plainto_tsquery('english', 'foobar')); QUERY PLAN ------------------------------------------------------------------------------------ Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.861..0.895 rows=1 loops=1) Total runtime: 0.935 ms (2 rows) # explain analyze select ts_headline('simple', text, plainto_tsquery('simple', 'foobar')); QUERY PLAN ------------------------------------------------------------------------------------ Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.627..0.660 rows=1 loops=1) Total runtime: 0.697 ms (2 rows) # As you can see, the results differ by an order of magnitude between Polish and English. While in this simple testcase it's a non-issue, in the real world this translates into enormous overhead. One of the queries I ran testing my site's search function took 1870ms. When I took that query and changed all ts_headline(foo) calls to just foo, the time dropped below 100ms. That's the difference between something completely unacceptable and something quite useful. I can post various details about the hardware, software and specific queries, but the testcases speak for themselves. I'm sure you can easily reproduce my results. Hints would be very much appreciated, since I've already spent way more time on this, than I could afford. cheers, Wojciech Knapik PS. A few other details can be found here http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with snippets of my conversations in #postgresql that lead to this testcase. Big thanks to RhodiumToad for helping me with fts for the last couple days ;]
On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote: > > Hello > > I just finished implementing a "search engine" for my site and found > ts_headline extremely slow when used with a Polish tsearch configuration, > while fast with English. All of it boils down to a simple testcase, but > first some background. > > I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0 (2.6.21), > then switched both installations to 8.3.8 (both packages compiled, but > provided by the distro - port/emerge). The Polish dictionaries and config > were created according to this article (it's in Polish, but the code is > self-explanatory): > > http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/ > > Now for the testcase: > > text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do > eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad > minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex > ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate > velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat > cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id > est laborum.' > > # explain analyze select ts_headline('polish', text, > plainto_tsquery('polish', 'foobar')); > QUERY PLAN > ------------------------------------------------------------------------------------ > Result (cost=0.00..0.01 rows=1 width=0) (actual time=6.407..6.470 rows=1 > loops=1) > Total runtime: 6.524 ms > (2 rows) > > # explain analyze select ts_headline('english', text, > plainto_tsquery('english', 'foobar')); > QUERY PLAN > ------------------------------------------------------------------------------------ > Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.861..0.895 rows=1 > loops=1) > Total runtime: 0.935 ms > (2 rows) > > # explain analyze select ts_headline('simple', text, > plainto_tsquery('simple', 'foobar')); > QUERY PLAN > ------------------------------------------------------------------------------------ > Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.627..0.660 rows=1 > loops=1) > Total runtime: 0.697 ms > (2 rows) > > # > > As you can see, the results differ by an order of magnitude between Polish > and English. While in this simple testcase it's a non-issue, in the real > world this translates into enormous overhead. > > One of the queries I ran testing my site's search function took 1870ms. > When I took that query and changed all ts_headline(foo) calls to just foo, > the time dropped below 100ms. That's the difference between something > completely unacceptable and something quite useful. > > I can post various details about the hardware, software and specific > queries, but the testcases speak for themselves. I'm sure you can easily > reproduce my results. > > Hints would be very much appreciated, since I've already spent way more > time on this, than I could afford. > > > cheers, > Wojciech Knapik > > > PS. A few other details can be found here > http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with > snippets of my conversations in #postgresql that lead to this testcase. Big > thanks to RhodiumToad for helping me with fts for the last couple days ;] > Hi, The documentation for ts_headline() states: ts_headline uses the original document, not a tsvector summary, so it can be slow and should be used with care. A typical mistake is to call ts_headline for every matching document when only ten documents are to be shown. SQL subqueries can help; here is an example: SELECT id, ts_headline(body, q), rank FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank FROM apod, to_tsquery('stars') q WHERE ti @@ q ORDER BY rank DESC LIMIT 10) AS foo; It looks like you have proven that behavior. I have not looked at the ts_headline code, but it may also be slowed by the locale, so showing that it is faster for English is not really saying much. Maybe there is a better algorithm that could be used, but that would require code changes. It may be that you can change some of the parameters to speed it up. Good luck. Regards, Ken
Kenneth Marshall <ktm@rice.edu> writes: > On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote: >> I just finished implementing a "search engine" for my site and found >> ts_headline extremely slow when used with a Polish tsearch configuration, >> while fast with English. > The documentation for ts_headline() states: > ts_headline uses the original document, not a tsvector summary, so it > can be slow and should be used with care. That's true but the argument in the docs would apply just as well to english or any other config. So while Wojciech would be well advised to try to avoid making a lot of calls to ts_headline, it's still curious that it's so much slower in polish than english. Could we see a self-contained test case? regards, tom lane
2009/11/14 Tom Lane <tgl@sss.pgh.pa.us>: > Kenneth Marshall <ktm@rice.edu> writes: >> On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote: >>> I just finished implementing a "search engine" for my site and found >>> ts_headline extremely slow when used with a Polish tsearch configuration, >>> while fast with English. > >> The documentation for ts_headline() states: >> ts_headline uses the original document, not a tsvector summary, so it >> can be slow and should be used with care. > > That's true but the argument in the docs would apply just as well to > english or any other config. So while Wojciech would be well advised > to try to avoid making a lot of calls to ts_headline, it's still curious > that it's so much slower in polish than english. Could we see a > self-contained test case? is it dictionary based or stem based? Dictionary based FTS is very slow (first load). Minimally czech FTS is slow. regards Pavel Stehule > > regards, tom lane > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
Yes, as stated original author use polish ispell dictionary. Ispell dictionary is slow to load first time. In real life it should be no problem. Oleg On Sat, 14 Nov 2009, Pavel Stehule wrote: > 2009/11/14 Tom Lane <tgl@sss.pgh.pa.us>: > > Kenneth Marshall <ktm@rice.edu> writes: > >> On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote: > >>> I just finished implementing a "search engine" for my site and found > >>> ts_headline extremely slow when used with a Polish tsearch configuratio= > n, > >>> while fast with English. > > > >> The documentation for ts_headline() states: > >> ts_headline uses the original document, not a tsvector summary, so it > >> can be slow and should be used with care. > > > > That's true but the argument in the docs would apply just as well to > > english or any other config. =C2=A0So while Wojciech would be well advised > > to try to avoid making a lot of calls to ts_headline, it's still curious > > that it's so much slower in polish than english. =C2=A0Could we see a > > self-contained test case? > > is it dictionary based or stem based? > > Dictionary based FTS is very slow (first load). Minimally czech FTS is slow. > > regards > Pavel Stehule > > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= > =A0 =C2=A0regards, tom lane > > > > -- > > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > > To make changes to your subscription: > > http://www.postgresql.org/mailpref/pgsql-performance > > > > --=20 > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
2009/11/15 Oleg Bartunov <oleg@sai.msu.su>: > Yes, as stated original author use polish ispell dictionary. > Ispell dictionary is slow to load first time. In real life it should be no > problem. > it is a problem. People who needs fast access uses english without czech. It drop some features, but it is significaly faster. Pavel > Oleg > On Sat, 14 Nov 2009, Pavel Stehule wrote: > >> 2009/11/14 Tom Lane <tgl@sss.pgh.pa.us>: >> > Kenneth Marshall <ktm@rice.edu> writes: >> >> On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote: >> >>> I just finished implementing a "search engine" for my site and found >> >>> ts_headline extremely slow when used with a Polish tsearch >> >>> configuratio= >> n, >> >>> while fast with English. >> > >> >> The documentation for ts_headline() states: >> >> ts_headline uses the original document, not a tsvector summary, so it >> >> can be slow and should be used with care. >> > >> > That's true but the argument in the docs would apply just as well to >> > english or any other config. =C2=A0So while Wojciech would be well >> > advised >> > to try to avoid making a lot of calls to ts_headline, it's still curious >> > that it's so much slower in polish than english. =C2=A0Could we see a >> > self-contained test case? >> >> is it dictionary based or stem based? >> >> Dictionary based FTS is very slow (first load). Minimally czech FTS is >> slow. >> >> regards >> Pavel Stehule >> >> > >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 >> > =C2= >> =A0 =C2=A0regards, tom lane >> > >> > -- >> > Sent via pgsql-performance mailing list >> > (pgsql-performance@postgresql.org) >> > To make changes to your subscription: >> > http://www.postgresql.org/mailpref/pgsql-performance >> > >> >> --=20 >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 >
On Sun, 15 Nov 2009, Pavel Stehule wrote: > 2009/11/15 Oleg Bartunov <oleg@sai.msu.su>: >> Yes, as stated original author use polish ispell dictionary. >> Ispell dictionary is slow to load first time. In real life it should be no >> problem. >> > > it is a problem. People who needs fast access uses english without > czech. It drop some features, but it is significaly faster. just don't use ispell dictionary, czech snowball stemmer is as fast as english. Ispell dictionary (doesn't matter english, or other language) is slow for the first load and then it caches, so there is no problem if use persistent database connection, which is de facto standard for any serious projects. > > Pavel > >> Oleg >> On Sat, 14 Nov 2009, Pavel Stehule wrote: >> >>> 2009/11/14 Tom Lane <tgl@sss.pgh.pa.us>: >>>> Kenneth Marshall <ktm@rice.edu> writes: >>>>> On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote: >>>>>> I just finished implementing a "search engine" for my site and found >>>>>> ts_headline extremely slow when used with a Polish tsearch >>>>>> configuratio= >>> n, >>>>>> while fast with English. >>>> >>>>> The documentation for ts_headline() states: >>>>> ts_headline uses the original document, not a tsvector summary, so it >>>>> can be slow and should be used with care. >>>> >>>> That's true but the argument in the docs would apply just as well to >>>> english or any other config. =C2=A0So while Wojciech would be well >>>> advised >>>> to try to avoid making a lot of calls to ts_headline, it's still curious >>>> that it's so much slower in polish than english. =C2=A0Could we see a >>>> self-contained test case? >>> >>> is it dictionary based or stem based? >>> >>> Dictionary based FTS is very slow (first load). Minimally czech FTS is >>> slow. >>> >>> regards >>> Pavel Stehule >>> >>>> >>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 >>>> =C2= >>> =A0 =C2=A0regards, tom lane >>>> >>>> -- >>>> Sent via pgsql-performance mailing list >>>> (pgsql-performance@postgresql.org) >>>> To make changes to your subscription: >>>> http://www.postgresql.org/mailpref/pgsql-performance >>>> >>> >>> --=20 >>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-performance >>> >> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 >> > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
2009/11/15 Oleg Bartunov <oleg@sai.msu.su>: > On Sun, 15 Nov 2009, Pavel Stehule wrote: > >> 2009/11/15 Oleg Bartunov <oleg@sai.msu.su>: >>> >>> Yes, as stated original author use polish ispell dictionary. >>> Ispell dictionary is slow to load first time. In real life it should be >>> no >>> problem. >>> >> >> it is a problem. People who needs fast access uses english without >> czech. It drop some features, but it is significaly faster. > > just don't use ispell dictionary, czech snowball stemmer is as fast as > english. czech stemmer doesn't exist :( > > Ispell dictionary (doesn't matter english, or other language) is slow for > the first load and then it caches, so there is no problem if use persistent > database connection, which is de facto standard for any serious projects. > I agree so connection pooling should be a solution. But it is good? Cannot we share dictionary better? >> >> Pavel >> >>> Oleg >>> On Sat, 14 Nov 2009, Pavel Stehule wrote: >>> >>>> 2009/11/14 Tom Lane <tgl@sss.pgh.pa.us>: >>>>> >>>>> Kenneth Marshall <ktm@rice.edu> writes: >>>>>> >>>>>> On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote: >>>>>>> >>>>>>> I just finished implementing a "search engine" for my site and found >>>>>>> ts_headline extremely slow when used with a Polish tsearch >>>>>>> configuratio= >>>> >>>> n, >>>>>>> >>>>>>> while fast with English. >>>>> >>>>>> The documentation for ts_headline() states: >>>>>> ts_headline uses the original document, not a tsvector summary, so it >>>>>> can be slow and should be used with care. >>>>> >>>>> That's true but the argument in the docs would apply just as well to >>>>> english or any other config. =C2=A0So while Wojciech would be well >>>>> advised >>>>> to try to avoid making a lot of calls to ts_headline, it's still >>>>> curious >>>>> that it's so much slower in polish than english. =C2=A0Could we see a >>>>> self-contained test case? >>>> >>>> is it dictionary based or stem based? >>>> >>>> Dictionary based FTS is very slow (first load). Minimally czech FTS is >>>> slow. >>>> >>>> regards >>>> Pavel Stehule >>>> >>>>> >>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 >>>>> =C2= >>>> >>>> =A0 =C2=A0regards, tom lane >>>>> >>>>> -- >>>>> Sent via pgsql-performance mailing list >>>>> (pgsql-performance@postgresql.org) >>>>> To make changes to your subscription: >>>>> http://www.postgresql.org/mailpref/pgsql-performance >>>>> >>>> >>>> --=20 >>>> Sent via pgsql-performance mailing list >>>> (pgsql-performance@postgresql.org) >>>> To make changes to your subscription: >>>> http://www.postgresql.org/mailpref/pgsql-performance >>>> >>> >>> Regards, >>> Oleg >>> _____________________________________________________________ >>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >>> Sternberg Astronomical Institute, Moscow University, Russia >>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >>> phone: +007(495)939-16-83, +007(495)939-23-83 >>> >> > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83
On Sun, 15 Nov 2009, Pavel Stehule wrote: > > czech stemmer doesn't exist :( > I'd try morfessor http://www.cis.hut.fi/projects/morpho/, which is unsupervised morphological dictionary. I think it'd be not very hard to add morfessor dictionary template to tsearch2, so people could create their own stemmers. >> >> Ispell dictionary (doesn't matter english, or other language) is slow for >> the first load and then it caches, so there is no problem if use persistent >> database connection, which is de facto standard for any serious projects. >> > > I agree so connection pooling should be a solution. But it is good? > Cannot we share dictionary better? We thought about this issue and got some idea. Teodor can be more clear here, since I don't remember all details. > >>> >>> Pavel >>> >>>> Oleg >>>> On Sat, 14 Nov 2009, Pavel Stehule wrote: >>>> >>>>> 2009/11/14 Tom Lane <tgl@sss.pgh.pa.us>: >>>>>> >>>>>> Kenneth Marshall <ktm@rice.edu> writes: >>>>>>> >>>>>>> On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote: >>>>>>>> >>>>>>>> I just finished implementing a "search engine" for my site and found >>>>>>>> ts_headline extremely slow when used with a Polish tsearch >>>>>>>> configuratio= >>>>> >>>>> n, >>>>>>>> >>>>>>>> while fast with English. >>>>>> >>>>>>> The documentation for ts_headline() states: >>>>>>> ts_headline uses the original document, not a tsvector summary, so it >>>>>>> can be slow and should be used with care. >>>>>> >>>>>> That's true but the argument in the docs would apply just as well to >>>>>> english or any other config. =C2=A0So while Wojciech would be well >>>>>> advised >>>>>> to try to avoid making a lot of calls to ts_headline, it's still >>>>>> curious >>>>>> that it's so much slower in polish than english. =C2=A0Could we see a >>>>>> self-contained test case? >>>>> >>>>> is it dictionary based or stem based? >>>>> >>>>> Dictionary based FTS is very slow (first load). Minimally czech FTS is >>>>> slow. >>>>> >>>>> regards >>>>> Pavel Stehule >>>>> >>>>>> >>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 >>>>>> =C2= >>>>> >>>>> =A0 =C2=A0regards, tom lane >>>>>> >>>>>> -- >>>>>> Sent via pgsql-performance mailing list >>>>>> (pgsql-performance@postgresql.org) >>>>>> To make changes to your subscription: >>>>>> http://www.postgresql.org/mailpref/pgsql-performance >>>>>> >>>>> >>>>> --=20 >>>>> Sent via pgsql-performance mailing list >>>>> (pgsql-performance@postgresql.org) >>>>> To make changes to your subscription: >>>>> http://www.postgresql.org/mailpref/pgsql-performance >>>>> >>>> >>>> Regards, >>>> Oleg >>>> _____________________________________________________________ >>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >>>> Sternberg Astronomical Institute, Moscow University, Russia >>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >>>> phone: +007(495)939-16-83, +007(495)939-23-83 >>>> >>> >> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83