Обсуждение: counting text matches - any recipes?

Поиск
Список
Период
Сортировка

counting text matches - any recipes?

От
"Ross J. Reedstrom"
Дата:
Hey all - 
I'm working on the query interface to a system that stores a bunch of
(realtively) small text documents, with titles and keywords, etc.

For the basic interface, we have the user types some words into a textbox,
then just search 'everything'. I'm weighting the results, so that hits
on the title count more than on keywords, which count more than on the
abstract, which count more than in the body.

This is working reasonably well, but I'm stuck on one counting problem:
what's the natural way to count multiple matches in SQL?

As an example, let's use the title a.k.a. 'name' of an article (called
modules, in the schema).

I've already generating a big select with a stanza for each class of match,
unioned together, weighted and summed. Here's a typical stanza for the name:
   select moduleid, name, version, created, revised, abstract,           count(*)*100 as weight    from current_modules
cm,abstracts a    where     cm.abstractid = a.abstractid     and (              name ~* 'Fourier'          or  name ~*
'series'     )    group by  moduleid, name, version, created, revised, abstract 
 

Obviously, this will give one hit on a module with the name 'Fourier Series',
as well as one for 'Fourier Transforms', and one for 'Time Series Analysis'.

It's probably blindingly obvious, but how would I structure this to get
_two_ hits for 'Fourier Series', that'll still scale to, say, a dozen
search terms entered? I've thought of the subselect route, as so:
   select moduleid, name, version, created, revised, abstract,           count(*)*100 as weight    from (       select
moduleid,abstractid, name, version, created, revised             from current_modules cm            where name ~*
'Fourier'      union all        select moduleid, abstractid, name, version, created, revised             from
current_modulescm            where name ~* 'series'    ) as bar, abstracts a    where     bar.abstractid = a.abstractid
  group by  moduleid, name, version, created, revised, abstract 
 

But I'm not sure how well that'll scale, since this is already a subselect,
so I'd be nesting two deep.

Ross


Re: counting text matches - any recipes?

От
Josh Berkus
Дата:
Ross,

> For the basic interface, we have the user types some words into a textbox,
> then just search 'everything'. I'm weighting the results, so that hits
> on the title count more than on keywords, which count more than on the
> abstract, which count more than in the body.

Before you re-invent the wheel, have you checked out OpenFTS?  www.openfts.org
(I think).
>
> This is working reasonably well, but I'm stuck on one counting problem:
> what's the natural way to count multiple matches in SQL?

Within a single text field?  There isn't.

2 choices:

1) Use a Full Text Searching engine, such as the simple one in /contrib or a
more full-featured one like OpenFTS.   Both will allow you to do counts on
"hits" for a keyword.

2) Write a function in PL/perl which will count the number of keyword matches
in a text string. Potentially slower than #1, but easier to implement.

--
-Josh BerkusAglio Database SolutionsSan Francisco