Обсуждение: Database-based alternatives to tsearch2?

Поиск
Список
Период
Сортировка

Database-based alternatives to tsearch2?

От
Wes
Дата:
I've seen questions asked on the list about alternatives to tsearch2, but
not for the type of full text indexing I'm looking for.

I'm looking for a non index-based full text indexing - one that stores the
information as table data instead of index data.  I do not need to implement
SQL operators for searches.  The application library would need to implement
the actual word search.

Indexes are too fragile.  Our documents will be offline, and re-indexing
would be impossible.  Additionally, as I undertstand it, tsearch2 doesn't
scale to the numbers I need (hundreds of millions of documents).

Is anyone aware of any such solutions for PostgreSQL, open source or
otherwise?

Thanks

Wes



Re: Database-based alternatives to tsearch2?

От
Jeff Davis
Дата:
On Tue, 2006-12-12 at 12:19 -0600, Wes wrote:
> I'm looking for a non index-based full text indexing - one that stores the
> information as table data instead of index data.  I do not need to implement
> SQL operators for searches.  The application library would need to implement
> the actual word search.
>

Store the tsvector (a custom type provided by tsearch2) as a separate
column in the table. This data type holds all the important information
about the indexed text, such as distinct words and some position
information, but it takes up much less space than a large document.

The tsearch2 package provides a lot of functionality even without the
index. But after you have a tsvector column, you can create an index on
it if you want.

> Indexes are too fragile.  Our documents will be offline, and re-indexing
> would be impossible.  Additionally, as I undertstand it, tsearch2 doesn't
> scale to the numbers I need (hundreds of millions of documents).
>

Try PostgreSQL 8.2 with tsearch2 using GIN. As I understand it, that's
very scalable.

Also, as I understand it, a GIN index should not need to be reindexed
unless there is a huge shift in the set of distinct words you're using.
However, if you do need to reindex, you can if you have the tsvector
column.

Regards,
    Jeff Davis




Re: Database-based alternatives to tsearch2?

От
Richard Huxton
Дата:
Wes wrote:
>
> Indexes are too fragile.  Our documents will be offline, and re-indexing
> would be impossible.  Additionally, as I undertstand it, tsearch2 doesn't
> scale to the numbers I need (hundreds of millions of documents).

Jeff's right about tsvector - sounds like it's what you're looking for.

If you're worried about reindexing costs, perhaps look at partioning the
table, or using partial indexes (so you could have multiple indexes for
each table, based on (id mod 100) or some such).

Obviously, partitioning over multiple machines is usually quite do-able
for this sort of task too.

> Is anyone aware of any such solutions for PostgreSQL, open source or
> otherwise?

Without wishing to discourage a potential large user from PG, it might
be worth checking if Google/Yahoo/etc have a non-relational server that
meets your needs off-the-shelf.

--
   Richard Huxton
   Archonet Ltd

Re: Database-based alternatives to tsearch2?

От
"Daniel Verite"
Дата:
    Wes wrote:

> I've seen questions asked on the list about alternatives to tsearch2, but
> not for the type of full text indexing I'm looking for.
>
> I'm looking for a non index-based full text indexing - one that stores the
> information as table data instead of index data.  I do not need to implement
> SQL operators for searches.  The application library would need to implement
> the actual word search.

Not sure if it would fit your needs, but DBIx-TextIndex implements
FTI using table data:
http://search.cpan.org/~dkoch/DBIx-TextIndex-0.25/lib/DBIx/TextIndex.pm

--
 Daniel
 PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org