Re: [SQL] Dilema.

Поиск
Список
Период
Сортировка
От Gene Selkov Jr.
Тема Re: [SQL] Dilema.
Дата
Msg-id 199912110650.AAA13687@mail.xnet.com
обсуждение исходный текст
Ответ на Dilema.  ("Mitch Vincent" <mitch@venux.net>)
Список pgsql-sql
"Mitch Vincent" wrote:
>   I have a problem that I'm hoping someone can help me with. I've built a
> web-based application that is back-ended by PGsql.. It works great, I
> couldn't ask for better preformance.
> 
> However, I just recently coded another module for it and this module needed
> to accept text files (or just large text blocks lets say). I ran into the 8k
> query limit barrier. I need these blocks of text to be searchable and am
> having to do this with a flat-file based solution right now, it's slow and
> cumbersome.

I realize my stuff does not qualify as anything for Tom Lane to write
home about. However, there's a chance that the method I am using fits
the purpose you describe (especially considering the quantities you
mentioned).

The following query is fairly complex, but it takes just a few seconds
to complete on our grossly overloaded server:


http://wit.mcs.anl.gov/EMP/select_emp_advanced.cgi?E1.ec_code=ec&E1.ec_code.op=%7E%09is+in+range&E1.ec_code.patt=3.4&ec_code.count=1&T1.text=or&T1.text.op=%7E*%09matches+regex.%2C+case-insensitive&T1.text.patt=rattus&T2.text=ocn&T2.text.op=%7E*%09matches+regex.%2C+case-insensitive&T2.text.patt=rat&T3.text=en&T3.text.op=%7E*%09matches+regex.%2C+case-insensitive&T3.text.patt=protease&text.count=1&N1.seg=km&N1.seg.op=%7E%09contained+in&N1.seg.patt=0+..+10&seg.count=1&constraint=E1+%26+%28T1+%7C+T2%29+%26+T3+%26+N1&do=Run+the+query

It would have taken at least as much to query any individual term with
Glimpse, and it would not be feasible to query things like E1 and N1
with anything but postgres because these queries are based on special
data types. Also, I can imagine some difficulty in doing all but
trivial relational operations using your average full-text engine.

The idea is simple:

1. Do not store the original files in postgres, or, if you are
desperate, store them as large objects. The only winning in large
objects is that they can be passed to the server and back through a
protocol supported by all postgres clients, but you loose to the
headache of maintaining the large objects which do need some
care. Note that there is no winning at all if your application is
already web-based by design.

A feasible alternative to both large objects and external file storage
is to use compression and uu or base64 encoding which allows one to
split their files into nice-looking numbered 61- or 76-character
lines. Postgres can handle an awesome lot of those.

Check out MIME::Base64 and Compress::Bzip2 modules if you are building
your clients in perl.

2. Having decided how to store and transfer the files, split them up
into words, load those into postgres and build indices on them. Word
indices can be quite efficient because you can use the ^-anchored
expressions with them. Also, parts of the data in your text can be
numeric, and it would be natural to index those as
numbers. Furthermore, if there are portions of your text that
represent distinct classes of data it would be equally natural to
store those in separate index tables. The are about 300 classes of
data in the database I am talking about, which is derived from the
published research articles. The following page roughly outlines how I
did it:

http://wit.mcs.anl.gov/EMP/indexing.html

This is not yet a full-text index but it would be trivial to make one
based on such paradigm. It won't allow you to search for arbitrary
substrings, but if you record the sequence in addition to the
absolute location of words, you will be able to search for "phrases"
of the AltaVista sorts, or even use proximity criteria like they do.

--Gene


В списке pgsql-sql по дате отправления:

Предыдущее
От: wieck@debis.com (Jan Wieck)
Дата:
Сообщение: Re: [SQL] Dilema.
Следующее
От: Brent Wood
Дата:
Сообщение: polygon problem