Обсуждение: text search changes vs. binary upgrade

Поиск
Список
Период
Сортировка

text search changes vs. binary upgrade

От
Noah Misch
Дата:
Commit bb14050 said:
   - change order for tsquery, so, users, who has a btree index over tsquery,     should reindex it

The last change of this sort also modified pg_upgrade to issue REINDEX
guidance.  See old_8_3_invalidate_hash_gin_indexes() in the PostgreSQL 9.4
source.  PostgreSQL 9.6 pg_upgrade should do likewise.


Commit 61d66c4 may or may not warrant pg_upgrade treatment:
   Fix support of digits in email/hostnames.      When tsearch was implemented I did several mistakes in hostname/email
 definition rules:   1) allow underscore in hostname what prohibited by RFC   2) forget to allow leading digits
separatedby hyphen (like 123-x.com)      in hostname   3) do no allow underscore/hyphen after leading digits in
localpartof email   ...
 

Any index (not just btree) that depends on a text search configuration using
parser pg_catalog.default may need a REINDEX after this change.  (Furthermore,
any constraint having such a dependency would need a recheck.  That use case
may be less important.)  I think the last changes to pg_catalog.default
semantics were 2c265ad (URLs) and 89b0095 (emails), both in 9.0.  For those,
we didn't change pg_upgrade or recommend REINDEX in the release notes.  We
could call that a relevant precedent and, for this 9.6 change, once again take
no particular action.  On the other hand, binary upgrade has matured since its
9.0 birth.  Perhaps standards have risen, and pg_upgrade should issue guidance
to REINDEX affected text search indexes.  (The guidance could mention the kind
of queries that will notice the difference.)  I lean toward having pg_upgrade
address this incompatibility.  Other opinions?



Re: text search changes vs. binary upgrade

От
Tom Lane
Дата:
Noah Misch <noah@leadboat.com> writes:
> Commit bb14050 said:
>     - change order for tsquery, so, users, who has a btree index over tsquery,
>       should reindex it

We undid that in 1ec4c7c05, no?  (Even if we didn't, the usefulness
of a btree index on tsquery seems negligibly small.)

> Commit 61d66c4 may or may not warrant pg_upgrade treatment:
>     Fix support of digits in email/hostnames.

The general theory about changes in text search parser and dictionary
behavior has always been that a reindex is not required, because that does
not invalidate the derived data in the same sort of way that changing,
say, btree sort order of a datatype would.  At worst, searches for the
specifically affected words might fail to find relevant entries because
to_tsvector now produces a different list of lexemes than before (and
those new lexemes are not in the index, the old ones are).  If the
affected set of words is sufficiently large and relevant to her use-case,
a user might judge that rebuilding derived tsvector data is worth her
trouble.  But I am dubious that pg_upgrade should issue guidance
unconditionally telling people to do it.  Most people probably aren't
going to have any noticeable amount of data that's affected by this change.

If we did worry about this for 61d66c4, then for example the unaccent
changes would also be problematic, and probably the ispell changes too.
I'm inclined to just group all those things in the release notes and
provide text counseling users to think about how much those changes affect
their full-text data and whether rebuilding derived tsvectors would be
worth it.
        regards, tom lane



Re: text search changes vs. binary upgrade

От
Noah Misch
Дата:
On Tue, May 03, 2016 at 11:13:54PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > Commit bb14050 said:
> >     - change order for tsquery, so, users, who has a btree index over tsquery,
> >       should reindex it
> 
> We undid that in 1ec4c7c05, no?

Ah, looks that way.

> > Commit 61d66c4 may or may not warrant pg_upgrade treatment:
> >     Fix support of digits in email/hostnames.
> 
> The general theory about changes in text search parser and dictionary
> behavior has always been that a reindex is not required, because that does
> not invalidate the derived data in the same sort of way that changing,
> say, btree sort order of a datatype would.  At worst, searches for the
> specifically affected words might fail to find relevant entries because
> to_tsvector now produces a different list of lexemes than before (and
> those new lexemes are not in the index, the old ones are).  If the
> affected set of words is sufficiently large and relevant to her use-case,
> a user might judge that rebuilding derived tsvector data is worth her
> trouble.  But I am dubious that pg_upgrade should issue guidance
> unconditionally telling people to do it.  Most people probably aren't
> going to have any noticeable amount of data that's affected by this change.
> 
> If we did worry about this for 61d66c4, then for example the unaccent
> changes would also be problematic, and probably the ispell changes too.
> I'm inclined to just group all those things in the release notes and
> provide text counseling users to think about how much those changes affect
> their full-text data and whether rebuilding derived tsvectors would be
> worth it.

Fair.