Re: Google Summer of Code 2008

Поиск

Список

Период

Сортировка

От	Oleg Bartunov
Тема	Re: Google Summer of Code 2008
Дата	8 марта 2008 г. 18:29:42
Msg-id	Pine.LNX.4.64.0803082219280.10010@sn.sai.msu.ru обсуждение исходный текст
Ответ на	Re: Google Summer of Code 2008 (Jan Urbański <j.urbanski@students.mimuw.edu.pl>)
Ответы	Re: Google Summer of Code 2008 (Tom Lane <tgl@sss.pgh.pa.us>) Re: Google Summer of Code 2008 (Jan Urbański <j.urbanski@students.mimuw.edu.pl>)
Список	pgsql-hackers

Дерево обсуждения

On Sat, 8 Mar 2008, Jan Urbaski wrote:

> Oleg Bartunov wrote:
>> Jan,
>> 
>> the problem is known and well requested. From your promotion it's not
>> clear what's an idea ?
>>> Tom Lane wrote:
>>>> Jan Urbański <j.urbanski@students.mimuw.edu.pl> 
>>>> writes:
>>>>> 2. Implement better selectivity estimates for FTS.
>
> OK, after reading through the some of the code the idea is to write a custom 
> typanalyze function for tsvector columns. It could look inside the tsvectors, 
> compute the most commonly appearing lexemes and store that information in 
> pg_statistics. Then there should be a custom selectivity function for @@ and 
> friends, that would look at the lexemes in pg_statistics, see if the tsquery 
> it got matches some/any of them and return a result based on that.

such function already exists, it's ts_stat(). The problem with ts_stat() is
its performance, since it sequentually scans ALL tsvectors. It's possible to
write special function for tsvector data type, which will be used by 
analyze, but I'm not sure sampling is a good approach here.
The way we could improve performance of gathering stats using ts_stat() is 
to process only new documents. It may be not as fast as it looks because of
lot of updates, so one need to think more about.

>
> I have a feeling that in many cases identifying the top 50 to 300 lexemes 
> would be enough to talk about text search selectivity with a degree of 
> confidence. At least we wouldn't give overly low estimates for queries 
> looking for very popular words, which I believe is worse than givng an overly 
> high estimate for a obscure query (am I wrong here?).

Unfortunately, selectivity estimation for query is much difficult than 
just estimate frequency of individual word.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Jan Urbański
Дата: 08 марта 2008 г., 17:56:22
Сообщение: Re: Google Summer of Code 2008

Следующее

От: Tom Lane
Дата: 08 марта 2008 г., 19:13:24
Сообщение: Re: Google Summer of Code 2008

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Google Summer of Code 2008

Предыдущее

Следующее