Re: Google Summer of Code 2008

Поиск
Список
Период
Сортировка
От Jan Urbański
Тема Re: Google Summer of Code 2008
Дата
Msg-id 47CD9758.4070403@students.mimuw.edu.pl
обсуждение исходный текст
Ответ на Re: Google Summer of Code 2008  (Oleg Bartunov <oleg@sai.msu.su>)
Список pgsql-hackers
Oleg Bartunov wrote:
> Jan,
>
> the problem is known and well requested. From your promotion it's not
> clear what's an idea ?

I guess the first approach could be to populate some more columns in
pg_statistics for tables with tsvectors. I see there are some statistics
already being gathered (pg_stat's histogram_bounds are populated for
tsvector columns), so maybe one could use that?
Even remembering a few of the most frequently appearing lexemes could in
my opinion help. I plotted distinct lexemes against the number documents
containing them (basically the output of stat()) in one of our databases
and came out with this:
http://www.fiok.pl/~jurbanski/kaired-depesze.png
The three high values are really stopwords, and partially because of
that I wrote my first FTS patch, but this shows that if we'd remember
the ~40 most frequent lexemes, we could give much better estimates for
popular queries (and I think are the ones that hurt performance most are
those which underestimate the row count).

As for a more general solution I'd have to read deeper into the tsearch
code to understand how the tsvector type and @@ operator work and give
it a bit more thought. I'm planning to do that in the next three weeks
(read: before the student applications period starts). Maybe some kind
of heuristic could be implemented? Possibly someone could load some
information specific to her language, which would tell the planner how
common (more or less) a given word is?

Another attempt at it would be: return lower estimates for tsqueries
consisting of more parts - 'X'::tsquery is usually far less selective
than 'X & Y & Z & V'::tsquery.

I searched through the list archives, but couldn't find any other
attempts at this problem - were there any?

Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: "could not open relation 1663/16384/16584: No such file or directory" in a specific combination of transactions with temp tables
Следующее
От: Zdenek Kotala
Дата:
Сообщение: Re: How to handle error message in PG_CATCH