Re: [SQL] Dilema.
От | Gene Selkov Jr. |
---|---|
Тема | Re: [SQL] Dilema. |
Дата | |
Msg-id | 199912110650.AAA13687@mail.xnet.com обсуждение исходный текст |
Ответ на | Dilema. ("Mitch Vincent" <mitch@venux.net>) |
Список | pgsql-sql |
"Mitch Vincent" wrote: > I have a problem that I'm hoping someone can help me with. I've built a > web-based application that is back-ended by PGsql.. It works great, I > couldn't ask for better preformance. > > However, I just recently coded another module for it and this module needed > to accept text files (or just large text blocks lets say). I ran into the 8k > query limit barrier. I need these blocks of text to be searchable and am > having to do this with a flat-file based solution right now, it's slow and > cumbersome. I realize my stuff does not qualify as anything for Tom Lane to write home about. However, there's a chance that the method I am using fits the purpose you describe (especially considering the quantities you mentioned). The following query is fairly complex, but it takes just a few seconds to complete on our grossly overloaded server: http://wit.mcs.anl.gov/EMP/select_emp_advanced.cgi?E1.ec_code=ec&E1.ec_code.op=%7E%09is+in+range&E1.ec_code.patt=3.4&ec_code.count=1&T1.text=or&T1.text.op=%7E*%09matches+regex.%2C+case-insensitive&T1.text.patt=rattus&T2.text=ocn&T2.text.op=%7E*%09matches+regex.%2C+case-insensitive&T2.text.patt=rat&T3.text=en&T3.text.op=%7E*%09matches+regex.%2C+case-insensitive&T3.text.patt=protease&text.count=1&N1.seg=km&N1.seg.op=%7E%09contained+in&N1.seg.patt=0+..+10&seg.count=1&constraint=E1+%26+%28T1+%7C+T2%29+%26+T3+%26+N1&do=Run+the+query It would have taken at least as much to query any individual term with Glimpse, and it would not be feasible to query things like E1 and N1 with anything but postgres because these queries are based on special data types. Also, I can imagine some difficulty in doing all but trivial relational operations using your average full-text engine. The idea is simple: 1. Do not store the original files in postgres, or, if you are desperate, store them as large objects. The only winning in large objects is that they can be passed to the server and back through a protocol supported by all postgres clients, but you loose to the headache of maintaining the large objects which do need some care. Note that there is no winning at all if your application is already web-based by design. A feasible alternative to both large objects and external file storage is to use compression and uu or base64 encoding which allows one to split their files into nice-looking numbered 61- or 76-character lines. Postgres can handle an awesome lot of those. Check out MIME::Base64 and Compress::Bzip2 modules if you are building your clients in perl. 2. Having decided how to store and transfer the files, split them up into words, load those into postgres and build indices on them. Word indices can be quite efficient because you can use the ^-anchored expressions with them. Also, parts of the data in your text can be numeric, and it would be natural to index those as numbers. Furthermore, if there are portions of your text that represent distinct classes of data it would be equally natural to store those in separate index tables. The are about 300 classes of data in the database I am talking about, which is derived from the published research articles. The following page roughly outlines how I did it: http://wit.mcs.anl.gov/EMP/indexing.html This is not yet a full-text index but it would be trivial to make one based on such paradigm. It won't allow you to search for arbitrary substrings, but if you record the sequence in addition to the absolute location of words, you will be able to search for "phrases" of the AltaVista sorts, or even use proximity criteria like they do. --Gene
В списке pgsql-sql по дате отправления: