Re: Google Summer of Code 2008

Поиск

Список

Период

Сортировка

От	Jan Urbański
Тема	Re: Google Summer of Code 2008
Дата	4 марта 2008 г. 14:45:54
Msg-id	47CD9758.4070403@students.mimuw.edu.pl обсуждение исходный текст
Ответ на	Re: Google Summer of Code 2008 (Oleg Bartunov <oleg@sai.msu.su>)
Список	pgsql-hackers

Дерево обсуждения

Oleg Bartunov wrote:
> Jan,
>
> the problem is known and well requested. From your promotion it's not
> clear what's an idea ?

I guess the first approach could be to populate some more columns in
pg_statistics for tables with tsvectors. I see there are some statistics
already being gathered (pg_stat's histogram_bounds are populated for
tsvector columns), so maybe one could use that?
Even remembering a few of the most frequently appearing lexemes could in
my opinion help. I plotted distinct lexemes against the number documents
containing them (basically the output of stat()) in one of our databases
and came out with this:
http://www.fiok.pl/~jurbanski/kaired-depesze.png
The three high values are really stopwords, and partially because of
that I wrote my first FTS patch, but this shows that if we'd remember
the ~40 most frequent lexemes, we could give much better estimates for
popular queries (and I think are the ones that hurt performance most are
those which underestimate the row count).

As for a more general solution I'd have to read deeper into the tsearch
code to understand how the tsvector type and @@ operator work and give
it a bit more thought. I'm planning to do that in the next three weeks
(read: before the student applications period starts). Maybe some kind
of heuristic could be implemented? Possibly someone could load some
information specific to her language, which would tell the planner how
common (more or less) a given word is?

Another attempt at it would be: return lower estimates for tsqueries
consisting of more parts - 'X'::tsquery is usually far less selective
than 'X & Y & Z & V'::tsquery.

I searched through the list archives, but couldn't find any other
attempts at this problem - were there any?

Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Google Summer of Code 2008