Обсуждение: fulltext searching via a custom index type

Поиск
Список
Период
Сортировка

fulltext searching via a custom index type

От
Eric Ridge
Дата:
(I started with this addressed to the -general list, but it just 
doesn't seem like a general usage topic.  If -hackers is the wrong 
place, please point me in the right direction).

I've been working on a custom index type (access method) that allows 
postgres to do fulltext searching (including phrase and proximity) that 
I believe is more flexible and natural than "tsearch2" (no offense to 
the tsearch2 guys... it's been a source of inspiration).  I also plan 
on providing "term browsing" and hit-position information.

I'm using Xapian (www.xapian.org) as the backend fulltext engine, and 
I'm trying to learn more of the innards of postgres as I go.  I've only 
spent about 18-24 hours on this, so it's nowhere near complete, and I 
really hesitate to even mention it in a public forum.  But if in the 
end it doesn't suck it might make a great contrib/ package.  *shrug*  
who knows?

In a nutshell, I've implemented all the necessary access method 
functions to teach postgres about a new index type named "xapian", and 
so far, everything is really great.  It works just like any other index 
type would:
create table test (stuff varchar(255), more_stuff text);create index idxstuff on test using xapian (stuff);create index
idxmore_stuffon test using xapian (more_stuff);
 
insert into test (stuff, more_stuff) values ('this is stuff', 'this is 
more stuff');insert into test (stuff, more_stuff) values ('my name is eric ridge', 
'i like to drink beer');
select * from test where stuff => 'stuff' or more_stuff => '"drink 
beer"';         stuff         |      more_stuff
-----------------------+---------------------- this is stuff         | this is more stuff my name is eric ridge | i
liketo drink beer
 
(2 rows)

All this aside, I've got some questions related to indexes and query 
plans.

1) Is it possible for an access method to receive some kind of "DROP 
INDEX" notification?  As far as I can tell, the answer is "no", but it 
can't hurt to ask.

2) When does the query planner decide that it needs to "Filter" 
results, and is there any way to turn this off or otherwise fool the 
query planner into NOT doing Filters (even if only for certain 
operators)?

For example, this query:select * from test where stuff => 'stuff' AND NOT more_stuff => 
'"drink beer"';
has this plan:Index Scan using idxstuff on test  (cost=0.00..-0.98 rows=250 
width=177)           Index Cond: ((stuff)::text => 'stuff'::text)           Filter: (NOT (more_stuff => '"drink
beer"'::text))

In this case, postgres is forced to re-parse the contents of the 
"more_stuff" field (via the defined procedure for the => operator) for 
every row returned by the previous index scan, just so it can determine 
if the field contains the phrase 'drink beer' or not.  Since so much 
overhead is involved here, it would be cool if postgres could somehow 
do another index scan.

Maybe there's some way for the operator function to know not only the 
Datum value, but the actual field (and ItemPointer)?

3) How does one get the $PGDATA directory? getenv() doesn't seem ideal 
since PGDATA can be specified as a command-line argument.

4) Can a function be registered as part of a transaction, pre-commit -- 
so the function can have an opportunity to abort the transaction.  I've 
seen RegisterEOXactCallback, but it's not quite what I'm looking for.

5) are there discussions in the archives of this list (or other pgsql- 
lists) that discuss fulltext searching that y'all think are worth 
reading?

thanks in advance for your time and input!

eric



Re: fulltext searching via a custom index type

От
Tom Lane
Дата:
Eric Ridge <ebr@tcdi.com> writes:
> 1) Is it possible for an access method to receive some kind of "DROP 
> INDEX" notification?

No.

>     select * from test where stuff => 'stuff' AND NOT more_stuff => 
> '"drink beer"';
> has this plan:

To do better here, you'd need to invent a "not-=>' operator, so that the
above could be simplified to, say,
 select * from test where stuff => 'stuff' AND more_stuff ~=> '"drink beer"';

and then define both => and ~=> as members of your index opclass, and
then build a two-column index on (stuff, more_stuff) ... whereupon
the planner would pass your index AM both clauses and it would be up
to you to do something intelligent.  You might want to back off a little
on how much of this you really want to do in a first cut.

> 3) How does one get the $PGDATA directory?

DataDir.  Why should you care?  An index AM that wants to know this is
probably broken IMHO; it's certainly trying to do something that's
outside the charter of index AMs, and is likely to cause lots of
headaches.

> 4) Can a function be registered as part of a transaction, pre-commit -- 
> so the function can have an opportunity to abort the transaction.

Why would that be a good idea?  When exactly would you expect it to be
called (relative to the other ninety-nine things that happen at commit)?
How are you going to recover if something else aborts the transaction,
either before or after your function runs?

I get the impression from your questions that you are trying to make an
end run around the transaction mechanism.  This is almost certainly
doomed to failure.  You need to find a way of storing all your data
within the ordinary index structure.
        regards, tom lane


Re: fulltext searching via a custom index type

От
Eric Ridge
Дата:
On Dec 26, 2003, at 3:22 PM, Tom Lane wrote:
>> 3) How does one get the $PGDATA directory?
>
> DataDir.  Why should you care?  An index AM that wants to know this is
> probably broken IMHO; it's certainly trying to do something that's
> outside the charter of index AMs, and is likely to cause lots of
> headaches.

Xapian has it's own storage subsystem, and that's what I'm using to 
store the index... not using anything internal to postgres (although 
this could change).  As such, Xapian needs to know *where* to save its 
indexes... $PGDATA seemed like a good place to start.

>> 4) Can a function be registered as part of a transaction, pre-commit 
>> --
>> so the function can have an opportunity to abort the transaction.
>
> Why would that be a good idea?  When exactly would you expect it to be
> called (relative to the other ninety-nine things that happen at 
> commit)?
> How are you going to recover if something else aborts the transaction,
> either before or after your function runs?

I don't really have an answer to this.  :)

> I get the impression from your questions that you are trying to make an
> end run around the transaction mechanism.

Perhaps.  I'm actually fishing for ideas to bridge xapian's transaction 
facilities to postgres.  Your comment confirms my suspicions that it 
just ain't gunna work out.

> This is almost certainly doomed to failure.  You need to find a way of 
> storing all your data
> within the ordinary index structure.

You are probably right.  :)  I'm just playing around right now.  I do 
appreciate your response, it's given me a lot to think about.

eric



Re: fulltext searching via a custom index type

От
Tom Lane
Дата:
Eric Ridge <ebr@tcdi.com> writes:
> Xapian has it's own storage subsystem, and that's what I'm using to 
> store the index... not using anything internal to postgres (although 
> this could change).

I would say you have absolutely zero chance of making it work that way.
You will not be able to get it to interoperate reliably with
transactions, checkpointing, or WAL replay; to say nothing of features
we might add in the future, such as tablespaces and point-in-time recovery.
You need to migrate all the data into the Postgres storage mechanism.

It might be worth pointing out here than an index AM is not bound to use
exactly the typical Postgres page layout.  I think you probably do have
to use the standard page header, but the page contents don't have to
look like tuples if you don't want 'em to.  For precedent see the hash
index AM, which stores ordinary index tuples on some index pages but
uses other pages for its own purposes.
        regards, tom lane


Re: fulltext searching via a custom index type

От
Eric Ridge
Дата:
On Dec 26, 2003, at 4:04 PM, Tom Lane wrote:

> Eric Ridge <ebr@tcdi.com> writes:
>> Xapian has it's own storage subsystem, and that's what I'm using to
>> store the index... not using anything internal to postgres (although
>> this could change).
>
> I would say you have absolutely zero chance of making it work that way.

thanks for the encouragement!  :)

> You will not be able to get it to interoperate reliably with
> transactions, checkpointing, or WAL replay; to say nothing of features
> we might add in the future, such as tablespaces and point-in-time 
> recovery.
> You need to migrate all the data into the Postgres storage mechanism.

And these are the things I'm struggling with now.  The basic indexing 
and searching currently works flawlessly, but the moment another user 
connects up, everything goes to hell.

> It might be worth pointing out here than an index AM is not bound to 
> use
> exactly the typical Postgres page layout.  I think you probably do have
> to use the standard page header, but the page contents don't have to
> look like tuples if you don't want 'em to.  For precedent see the hash
> index AM, which stores ordinary index tuples on some index pages but
> uses other pages for its own purposes.

That's useful information.  Thanks.  I've been using the hash AM as my 
guide b/c it's not as complex as the other index types (atleast on the 
public interface side).  Obviously, I'm trying to save the time and 
energy of re-inventing the wheel when it comes full text indexing and 
searching.  Xapian is an awesome standalone engine (and it's amazingly 
fast too!), so it seemed like a good place to start.  It's backend 
storage subsystem is pluggable, and after our little exchange here 
today, I'm now considering writing a postgres backend for Xapian.

I assume the doc chapter on Page Files and the various storage-related 
README files are good places for more information.  Any other tips or 
pointers?

eric



Re: fulltext searching via a custom index type

От
Tom Lane
Дата:
Eric Ridge <ebr@tcdi.com> writes:
> I assume the doc chapter on Page Files and the various storage-related 
> README files are good places for more information.  Any other tips or 
> pointers?

Not right offhand, but feel free to ask questions when you get stuck.
Also don't forget that there's a wealth of info in source code comments;
you should always try looking for existing code that does something
similar to what you want to do.

BTW, one problem with using the hash AM as a model is that it's not yet
WAL-aware.  The btree AM is the only one that's really WAL-ified yet,
so you need to look at that if you want to think now about how to make
your code safe for WAL.  I think you'd probably be okay to postpone this
consideration for later, but keep in mind that all your operations that
change the contents of index files should be divisible into bite-size
operations that can be logged as WAL entries.  Each "bite-size
operation" has to leave the index in a legal state, too, so there's a
limit as to how small you can subdivide.
        regards, tom lane


Postgres + Xapian (was Re: fulltext searching via a custom index type )

От
Eric B.Ridge
Дата:
On Dec 26, 2003, at 4:04 PM, Tom Lane wrote:
> Eric Ridge <ebr@tcdi.com> writes:
>> Xapian has it's own storage subsystem, and that's what I'm using to
>> store the index... not using anything internal to postgres (although
>> this could change).
>
> I would say you have absolutely zero chance of making it work that way.

I still think this is one of the best quotes I've heard in awhile.  :)

> It might be worth pointing out here than an index AM is not bound to
> use
> exactly the typical Postgres page layout.

Thanks again for this little bit of info.  It was just enough to get me
thinking about how to make "it work".

Xapian is basically a big btree index, except it's 5 btree indexes.
One for terms, one for posts (terms with positions), one for positions,
one for arbitrary document values, and one for the documents
themselves.  Each index is made up of 3 physical files on disk.  All
told there's 17 files for a single Xapian index (15 db files, a
versioninfo file, and a lock file).

I couldn't think of a way to create a whole new database type for
Xapian that could deal with managing 5 btree indexes inside of Postgres
(other than using tables w/ standard postgres btree index on certain
fields), so instead, I dug into Xapian and abstracted out it's
filesystem i/o (open, read, write, etc).

(as an aside, I did spend some time pondering ways to adapt Postgres'
nbtree AM to handle this, but I just don't understand how it works)

Once I had Xapian's filesystem i/o encapsulated into a nice little C++
class, I embarked on creating a mini "filesystem" ontop of Postgres'
storage subsystem.  In essence, I've now got a Postgres access method
that mirrors the basics of a filesystem, from creating/open files to
reading from and writing to them, in addition to truncation and
deletion.

After that, it was just a matter of the glue code to teach Xapian to
use this "filesystem" for all its filesystem i/o, and voila!, Xapian
works ontop of Postgres' storage subsystem and I didn't have to rewrite
Xapian from scratch.  And surprisingly, despite the additional overhead
of this filesystem abstraction layer, it's still very fast... esp. once
Buffers get cached.

I've still got more work to do (like dealing with locking and general
concurrency issues, not to mention bugs I haven't found yet), but it's
working *really* well in a single-user environment.

So here's the important question:  How stupid is this?

I've done some benchmarking against tsearch2.  Attached are the queries
and execution times on my dual 2gig G5 w/ 2gig ram.

The table contains 51,160 records.  It's every text file contained on
my computer (which includes multiple copies of all my java projects).
All told, it's 337,343,569 bytes of data, with an average file size of
6,594 bytes.  The Xapian operator is "=>", and tsearch2's operator is
"@@".  I ran each query 6 times, and just took the best execution time.

It's also worth noting that my document parser is much different than
tsearch2's.  I'm splitting words on non-alphanumerics (and currently am
not using stopwords), and it seems that tsearch2 tries to do something
more intelligent, so the # of results returned vary widely between
tsearch2 and Xapian.  I'm not offering an opinion on which way is
"better".

I've got a few more questions about transactions, locking, and a few
other things, but I just thought I'd throw this out as a status report
and to see if there's any kind of reaction.

thanks for your time.

eric


Вложения

Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )

От
Alvaro Herrera
Дата:
On Thu, Jan 01, 2004 at 11:19:07PM -0500, Eric B.Ridge wrote:

> I couldn't think of a way to create a whole new database type for 
> Xapian that could deal with managing 5 btree indexes inside of Postgres 
> (other than using tables w/ standard postgres btree index on certain 
> fields), so instead, I dug into Xapian and abstracted out it's 
> filesystem i/o (open, read, write, etc).
> 
> (as an aside, I did spend some time pondering ways to adapt Postgres' 
> nbtree AM to handle this, but I just don't understand how it works)

I think your approach is too ugly.  You will have tons of problems the
minute you start thinking about concurrency (unless you want to allow
only a single user accessing the index) and recovery (unless you want to
force users to REINDEX when the system crashes).

I think one way of attacking the problem would be using the existing
nbtree by allowing it to store the five btrees.  First read the README
in the nbtree dir, and then poke at the metapage's only structure.  You
will see that it has a BlockNumber to the root page of the index.  Try
modifying that to make it have a BlockNumber to every index's root page.
You will have to provide ways to access each root page and maybe other
nonstandard things (such as telling the root split operation what root
page are you going to split), but you will get recovery and concurrency
(at least to a point) for free.

Hope this helps,

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La espina, desde que nace, ya pincha" (Proverbio africano)


Re: Postgres + Xapian (was Re: fulltext searching via a

От
Christopher Kings-Lynne
Дата:
> I think one way of attacking the problem would be using the existing
> nbtree by allowing it to store the five btrees.  First read the README
> in the nbtree dir, and then poke at the metapage's only structure.  You
> will see that it has a BlockNumber to the root page of the index.  Try
> modifying that to make it have a BlockNumber to every index's root page.
> You will have to provide ways to access each root page and maybe other
> nonstandard things (such as telling the root split operation what root
> page are you going to split), but you will get recovery and concurrency
> (at least to a point) for free.

Why not write it using the GiST interface - that is after all the entire 
point of GiST...

Chris


Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )

От
Alvaro Herrera
Дата:
On Sat, Jan 03, 2004 at 02:49:23PM +0800, Christopher Kings-Lynne wrote:
> >I think one way of attacking the problem would be using the existing
> >nbtree by allowing it to store the five btrees.
> 
> Why not write it using the GiST interface - that is after all the entire 
> point of GiST...

Well, the project is to adapt Postgres as a Xapian backend ... I think
it is much more straightforward to use the BTrees that Xapian already
uses, than to use GiST.  Plus, GiST already has an interface for
fulltext indexing, doesn't it?

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La victoria es para quien se atreve a estar solo"


Re: Postgres + Xapian (was Re: fulltext searching via a

От
Oleg Bartunov
Дата:
On Sat, 3 Jan 2004, Christopher Kings-Lynne wrote:

> > I think one way of attacking the problem would be using the existing
> > nbtree by allowing it to store the five btrees.  First read the README
> > in the nbtree dir, and then poke at the metapage's only structure.  You
> > will see that it has a BlockNumber to the root page of the index.  Try
> > modifying that to make it have a BlockNumber to every index's root page.
> > You will have to provide ways to access each root page and maybe other
> > nonstandard things (such as telling the root split operation what root
> > page are you going to split), but you will get recovery and concurrency
> > (at least to a point) for free.
>
> Why not write it using the GiST interface - that is after all the entire
> point of GiST...

I think this would be the best approach, we have already btree_gist which
is a gist realization of btree. We started working on concurrency and recovery
for GiST and hope to get it working for 7.5. Another work we'd like to
work is extending of GiST interface, so other important ADT's like digital
trees, quad-tree, s-tree etc could be implemented. But we decided to get
concurrency and recovery works at first.

>
> Chris
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )

От
Eric Ridge
Дата:
On Jan 2, 2004, at 4:54 PM, Alvaro Herrera wrote:
> I think your approach is too ugly.  You will have tons of problems the
> minute you start thinking about concurrency (unless you want to allow
> only a single user accessing the index)

It might be ugly, but it's very fast.  Surprisingly fast, actually.

Concerning concurrency, Xapian internally supports multiple readers and 
only 1 concurrent writer.  So the locking requirements should be far 
less complex than a true concurrent solution.  Now, I'm not arguing 
that this ideal, but if Xapian is a search engine you're interested in, 
then you've already made up your mind that you're willing to deal with 
1 writer at a time.

However, Xapian does have built-in support for searching multiple 
databases at once.  One thought I've had is to simply create a new 
1-document database on every INSERT/UPDATE beyond the initial CREATE 
INDEX.  Then whenever you do an index scan, tell Xapian to use all the 
little databases that exist in the index.  This would give some bit of 
concurrency.  Then on VACUUM (or FULL), all these little databases 
could be merged back into the main index.

> and recovery (unless you want to force users to REINDEX when the 
> system crashes).

I don't yet understand how the WAL stuff works.  I haven't looked at 
the API's yet, but if something you can record is "write these bytes to 
this BlockNumber at this offset", or if you can say, "index Tuple X 
from Relation Y", then it seems like recovery is still possible.

If ya can't do any of that, then I need to go look at WAL further.

> I think one way of attacking the problem would be using the existing
> nbtree by allowing it to store the five btrees.  First read the README
> in the nbtree dir, and then poke at the metapage's only structure.  You
> will see that it has a BlockNumber to the root page of the index.

Right, I had gotten this far in my investigation already.  The daunting 
thing about trying to use the nbtree code, is the a code itself.  It's 
very complex.  Plus, I just don't know how well the rest of Xapian 
would respond to all of a sudden having a concurrent backend.  It's 
likely that it would make no difference, but it's just an unknown to me 
at this time.

> Try modifying that to make it have a BlockNumber to every index's root 
> page.
> You will have to provide ways to access each root page and maybe other
> nonstandard things (such as telling the root split operation what root
> page are you going to split), but you will get recovery and concurrency
> (at least to a point) for free.

And I'm not convinced that recovery and concurrency would be "for free" 
in this case either.  The need to keep essentially 5 different trees in 
sync greatly complicates the concurrency issue, I would think.

thanks for your time!

eric



Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )

От
Eric Ridge
Дата:
Thanks to everyone that provided input about this idea.  After taking 
everything into consideration and talking with Olly Betts from the 
Xapian project, what I'm going to do is implement a btree that supports 
multiple roots and an "tag" value of arbitrary length for each item in 
the tree.  Then implement a new Xapian backend that uses it.  In the 
end, it'll be much more efficient and much less dirty than this silly 
filesystem thing I have now.  And hopefully, concurrency and WAL will 
be easier to deal with too.  Having the existing nbtree AM as guide 
will be very useful.

I have one issue that could potentially make all of this completely 
useless, and it's related to Postgres deciding to use a Filter instead 
of an Index Scan.

I asked this question last week, and Tom Lane responded with a 
solution, but I don't think I explained myself very well.  And now that 
I've got a bit more experience with some of the PG internals, maybe I 
can ask the question more intelligently.

Given this query:     select id from table where document_text ==> 'food' and title ==> 
'water';

Postgres generates this plan: Index Scan using idxtitle on table  (cost=0.00..4.01 rows=1 width=8)   Index Cond: (title
==>'water'::text)   Filter: (document_text ==> 'food'::text)
 

The problem is, the "document_text" column is *really* big.  Average 
value length is 171k.

With this query plan, my operator procedure is forced to re-parse the 
document_text column from each row returned by the index scan against 
the title column, and do a bunch of Xapian tricks for each row.  The 
overhead is huge.

The composite index solution doesn't seem ideal here because there's 
absolutely no way I can anticipate every combination of fields a user 
might choose to search.

Forgetting about composite indexes for a moment, is postgres even 
capable of doing 2 index scans in this situation?  Does it know how do 
take the intersection of two scans?

My AM's cost estimate function literally sets the selectivity, 
correlation, total cost, and startup cost values to zero (and I've 
tried tons of combinations of really big, really small, and really 
negative values).  My thought behind setting them all to zero was that 
if the cost function basically says, "There is no cost", I could fool 
Postgres into wanting to use the index everywhere it can.  Sadly, this 
doesn't work out.

Now, I realize I can fix my cost estimate function to return better 
costs for the title and document_text fields so that PG will instead 
decide to index scan on document_text, but what I really want to do is 
make PG do an index scan for each field.

Can PG even do this?

I'd appreciate any thoughts on this.  Hopefully, I'm just missing 
something really obvious.

thanks!

eric



Re: Postgres + Xapian (was Re: fulltext searching via a

От
Hannu Krosing
Дата:
Eric Ridge kirjutas R, 09.01.2004 kell 01:16:

> Forgetting about composite indexes for a moment, is postgres even 
> capable of doing 2 index scans in this situation?  Does it know how do 
> take the intersection of two scans?

Not currently, but it has been in TODO for quite some time:

* Use bitmaps to fetch heap pages in sequential order [performance]
* Use bitmaps to combine existing indexes [performance]

---------------
Hannu



Re: Postgres + Xapian (was Re: fulltext searching via a

От
Bruce Momjian
Дата:
Hannu Krosing wrote:
> Eric Ridge kirjutas R, 09.01.2004 kell 01:16:
> 
> > Forgetting about composite indexes for a moment, is postgres even 
> > capable of doing 2 index scans in this situation?  Does it know how do 
> > take the intersection of two scans?
> 
> Not currently, but it has been in TODO for quite some time:
> 
> * Use bitmaps to fetch heap pages in sequential order [performance]
> * Use bitmaps to combine existing indexes [performance]

And this the foundation of bitmapped indexes, as I understand it, and is
used for "star" joins, popular with data warehousing.

Rather than use those terms, I spelled out the actual behavior we want.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Postgres + Xapian (was Re: fulltext searching via a

От
Greg Stark
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:

> Hannu Krosing wrote:
> > Eric Ridge kirjutas R, 09.01.2004 kell 01:16:
> > 
> > > Forgetting about composite indexes for a moment, is postgres even 
> > > capable of doing 2 index scans in this situation?  Does it know how do 
> > > take the intersection of two scans?
> > 
> > Not currently, but it has been in TODO for quite some time:
> > 
> > * Use bitmaps to fetch heap pages in sequential order [performance]
> > * Use bitmaps to combine existing indexes [performance]
> 
> And this the foundation of bitmapped indexes, as I understand it, and is
> used for "star" joins, popular with data warehousing.

Actually I think the thinking with those two TODO items is to use these
algorithms in conjunction with regular indexes. Doing regular index scans in
parallel, construct bitmaps, perform logical operations to join them, and then
do the table scan.

It could be done in conjunction with a scheme for storing bitmap indexes or
not. Probably not though.

-- 
greg