Обсуждение: Tsearch docs question

Поиск
Список
Период
Сортировка

Tsearch docs question

От
Jeff Davis
Дата:
The Tsearch docs, under the GiST and GIN section, say:

"Lossiness [of GiST] causes serious performance degradation since random
access of heap records is slow and limits the usefulness of GiST
indexes."

The docs do go into some detail, but I think it causes some confusion,
also.

Let me digress to state how I understand the relationship between GIN,
GiST, and RECHECK:

The benefit of avoiding RECHECK is to avoid the need to re-evaluate the
predicate after finding the entry in the index. This can be valuable in
tsearch, because the functions are much more expensive than (for
example) integer equality. We (currently) have to visit the heap anyway,
to see the visibility information. So avoiding a RECHECK clause doesn't
do anything to prevent random heap I/O (although, a less-lossy index
will have fewer false positives, by definition).

GIN (as used with tsearch) is lossy for more sophisticated tsqueries
(those involving labels) and non-lossy for simpler tsqueries. There's
only one tsquery type, so PostgreSQL has no way of differentiating
between these two cases.

GiST (as used with tsearch) is lossy for large tsvectors or tsqueries
containing labels; and non-lossy for small tsvectors matched against a
tsquery that contains no labels. PostgreSQL can't differentiate between
these two cases.

So, for GiST they always RECHECK (so you're always sure to get the right
result), and for GIN the default operator does not RECHECK (for
performance), but if you suspect that you might be using labels in your
tsqueries you need to use a special RECHECKing operator, "@@@", to be
accurate.

Is the above accurate?

Back to the docs: I think the docs could clear this issue up somewhat.
The current wording suggests that GIN performs better because it avoids
a trip to the heap, when in reality it seems the benefit is avoiding the
need to re-evaluate the expensive tsearch functions (which might need to
access TOASTed data).

There's also a related issue: I think a RECHECK would be less costly if
you have the tsvectors materialized in the table (using triggers) and
index that. Maybe that could be a tip for using GiST indexes.

Regards,
    Jeff Davis


Re: Tsearch docs question

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> The Tsearch docs, under the GiST and GIN section, say:
> "Lossiness [of GiST] causes serious performance degradation since random
> access of heap records is slow and limits the usefulness of GiST
> indexes."

> The docs do go into some detail, but I think it causes some confusion,
> also.

Are you looking at CVS HEAD, or what was there in beta1?  I rewrote
that stuff a few days ago:
http://developer.postgresql.org/pgdocs/postgres/textsearch-indexes.html

> There's also a related issue: I think a RECHECK would be less costly if
> you have the tsvectors materialized in the table (using triggers) and
> index that. Maybe that could be a tip for using GiST indexes.

Yeah, I mentioned that somewhere in the chapter, I think.

            regards, tom lane

Re: Tsearch docs question

От
Jeff Davis
Дата:
On Fri, 2007-10-26 at 15:26 -0400, Tom Lane wrote:
> Are you looking at CVS HEAD, or what was there in beta1?  I rewrote
> that stuff a few days ago:
> http://developer.postgresql.org/pgdocs/postgres/textsearch-indexes.html
>

Excellent, thanks, that's a big improvement to those docs all around. I
should have checked the latest before posting, almost everything I
mentioned was already addressed.

There's still one very minor thing:

"A GiST index is lossy, meaning it is necessary to check the actual
table row to eliminate false matches."

could be changed to something like:

"A GiST index is lossy, meaning that the index may produce false
matches, and it is necessary to check the actual table row before
eliminating these false matches.

And perhaps change:

"Lossiness causes performance degradation since random access to table
records is slow; ..."

to something like:

"Lossiness causes performance degradation due to unnecessary random
accesses to table records; ..."

The only reason I say this is because, on my first reading, I read that
to mean that lossless indexes don't require trips to the heap at all
(which isn't true, yet).

Regards,
    Jeff Davis


Re: Tsearch docs question

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> There's still one very minor thing:

Updated, thanks for the suggestions.

            regards, tom lane