Обсуждение: full-text indexing
Does anyone have any experience using Postgres's full-text indexing in a
production environment? We're thinking about using it for a project (the
other solution is to regexp it with Perl...). I've set up the stuff
before for experimentation, but am mainly curious about it's performance
on a live, fairly heavily trafficked server.
Brett W. McCoy
http://www.chapelperilous.net
---------------------------------------------------------------------------
"The chain which can be yanked is not the eternal chain."
-- G. Fitch
> Does anyone have any experience using Postgres's full-text indexing in a > production environment? We're thinking about using it for a project (the > other solution is to regexp it with Perl...). I've set up the stuff > before for experimentation, but am mainly curious about it's performance > on a live, fairly heavily trafficked server. I have one word for you: CLUSTER. Without it, index lookups are too slow. With it, they are rapid. I have done some work like this commerically with Ingres, which has an ISAM type that keeps the matching rows pretty close on a newly-created ISAM index. In PostgreSQL, and regular CLUSTER will keep you good. If you find it slow, let me know. I have done some benchmarking with the author and he found it pretty fast, usually a few seconds. See the section in my book on CLUSTER for information on _why_ it helps. http://www.postgresql.org/docs/awbook.html -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
At 08:40 PM 18/04/00 -0400, Bruce Momjian wrote: >I have one word for you: CLUSTER. Without it, index lookups are too >slow. With it, they are rapid. I have done some work like this >commerically with Ingres, which has an ISAM type that keeps the matching >rows pretty close on a newly-created ISAM index. In PostgreSQL, and >regular CLUSTER will keep you good. The only down side of CLUSTER is that you loose your other keys on the table, including a primary key, so you loose out on your intregrity checks, which you the have to explictily re-apply with CREATE UNQIUE INDEX ... -- Kumera - a new Open Source Content Management System for small to medium web sites written in Perl and using XML http://www.cyber4.org/kumera/index.html
> At 08:40 PM 18/04/00 -0400, Bruce Momjian wrote: > > >I have one word for you: CLUSTER. Without it, index lookups are too > >slow. With it, they are rapid. I have done some work like this > >commerically with Ingres, which has an ISAM type that keeps the matching > >rows pretty close on a newly-created ISAM index. In PostgreSQL, and > >regular CLUSTER will keep you good. > > The only down side of CLUSTER is that you loose your other keys on the > table, including a primary key, so you loose out on your intregrity checks, > which you the have to explictily re-apply with CREATE UNQIUE INDEX ... > Yes, you basically have to recreate them after CLUSTER. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Tue, 18 Apr 2000, Bruce Momjian wrote:
> I have one word for you: CLUSTER. Without it, index lookups are too
> slow. With it, they are rapid. I have done some work like this
> commerically with Ingres, which has an ISAM type that keeps the matching
> rows pretty close on a newly-created ISAM index. In PostgreSQL, and
> regular CLUSTER will keep you good.
I agree! The last bit of advice given in the full text README. As I
said, I'd built full-text stuff for experimentation (I had maybe 30k of
raw text, which amounted to several 100,000 indexed entries), and I had
clustered it, and it was pretty darn fast, even on a Pentium 233 with only
48 megs of RAM. I have significantly better hardware to run it on now.
The original project called MySQL, but it just didn't have what we needed
to put something like this together.
> If you find it slow, let me know. I have done some benchmarking with
> the author and he found it pretty fast, usually a few seconds. See the
> section in my book on CLUSTER for information on _why_ it helps.
Thanks, Bruce.
Brett W. McCoy
http://www.chapelperilous.net
---------------------------------------------------------------------------
Twenty two thousand days.
Twenty two thousand days.
It's not a lot.
It's all you've got.
Twenty two thousand days.
-- Moody Blues, "Twenty Two Thousand Days"
> On Tue, 18 Apr 2000, Bruce Momjian wrote: > > > I have one word for you: CLUSTER. Without it, index lookups are too > > slow. With it, they are rapid. I have done some work like this > > commerically with Ingres, which has an ISAM type that keeps the matching > > rows pretty close on a newly-created ISAM index. In PostgreSQL, and > > regular CLUSTER will keep you good. > > I agree! The last bit of advice given in the full text README. As I > said, I'd built full-text stuff for experimentation (I had maybe 30k of > raw text, which amounted to several 100,000 indexed entries), and I had > clustered it, and it was pretty darn fast, even on a Pentium 233 with only > 48 megs of RAM. I have significantly better hardware to run it on now. > The original project called MySQL, but it just didn't have what we needed > to put something like this together. With the original author, testing was fast, but when he loaded all the data, it got very slow. The problem was that as soon as his data exceeded the buffer cache, performance became terrible. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Tue, 18 Apr 2000, Bruce Momjian wrote:
> > I agree! The last bit of advice given in the full text README. As I
> > said, I'd built full-text stuff for experimentation (I had maybe 30k of
> > raw text, which amounted to several 100,000 indexed entries), and I had
> > clustered it, and it was pretty darn fast, even on a Pentium 233 with only
> > 48 megs of RAM. I have significantly better hardware to run it on now.
> > The original project called MySQL, but it just didn't have what we needed
> > to put something like this together.
>
> With the original author, testing was fast, but when he loaded all the
> data, it got very slow. The problem was that as soon as his data
> exceeded the buffer cache, performance became terrible.
How much data are we talking here? How can one get around this buffer
cache problem?
Brett W. McCoy
http://www.chapelperilous.net
---------------------------------------------------------------------------
Twenty two thousand days.
Twenty two thousand days.
It's not a lot.
It's all you've got.
Twenty two thousand days.
-- Moody Blues, "Twenty Two Thousand Days"
> On Tue, 18 Apr 2000, Bruce Momjian wrote: > > > > I agree! The last bit of advice given in the full text README. As I > > > said, I'd built full-text stuff for experimentation (I had maybe 30k of > > > raw text, which amounted to several 100,000 indexed entries), and I had > > > clustered it, and it was pretty darn fast, even on a Pentium 233 with only > > > 48 megs of RAM. I have significantly better hardware to run it on now. > > > The original project called MySQL, but it just didn't have what we needed > > > to put something like this together. > > > > With the original author, testing was fast, but when he loaded all the > > data, it got very slow. The problem was that as soon as his data > > exceeded the buffer cache, performance became terrible. > > How much data are we talking here? How can one get around this buffer > cache problem? > You would have to fit _all_ your heap data into the PostgreSQL buffer cache. That is a lot of shared memory. If it was that small, you wouldn't need full-text indexing. :-) -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Tue, 18 Apr 2000, Bruce Momjian wrote:
> Let me be specific. The problem is that without cluster, your fragment
> rows are together in the index, but are all over the heap table, so you
> have to read in all those disk buffers, and that is slow. With cluster,
> most of your matching fragments are on the same disk page, so one access
> gets them all.
>
> The nightly CLUSTER is a pain, but the only way I ever got it working
> quickly.
This is pretty painless compared to what I've had to do getting big
full-text systems like Excalibur to work. I can deal with a nightly
cluster.
Brett W. McCoy
http://www.chapelperilous.net
---------------------------------------------------------------------------
Twenty two thousand days.
Twenty two thousand days.
It's not a lot.
It's all you've got.
Twenty two thousand days.
-- Moody Blues, "Twenty Two Thousand Days"
Hi,
I guess that 'original author' would be me....
> > With the original author, testing was fast, but when he loaded all the
> > data, it got very slow. The problem was that as soon as his data
> > exceeded the buffer cache, performance became terrible.
>
> How much data are we talking here? How can one get around this buffer
> cache problem?
This is all from head, but if I remember correctly, the main table had about
750.000 rows, of which one varchar(25) field was fti'ed ('full text indexed'
:),
resulting in some 5 million rows in the fti table.
wrt file sizes, I don't really remember. If you're really interested, I can
make another setup to check this (over easter time).
I'm curious: Bruce mentioned buffer cache sizes. What exactly is this buffer
cache
used for? I thought we relied on the OS filesystem caching to cache database
files?
What will increasing buffer caches give me?
Maarten
--
Maarten Boekhold, maarten.boekhold@tibcofinance.com
TIBCO Finance Technology Inc.
"Sevilla" Building
Entrada 308
1096 ED Amsterdam, The Netherlands
tel: +31 20 6601000 (direct: +31 20 6601066)
fax: +31 20 6601005
http://www.tibcofinance.com
> Hi,
>
> I guess that 'original author' would be me....
Yes.
>
> > > With the original author, testing was fast, but when he loaded all the
> > > data, it got very slow. The problem was that as soon as his data
> > > exceeded the buffer cache, performance became terrible.
> >
> > How much data are we talking here? How can one get around this buffer
> > cache problem?
>
> This is all from head, but if I remember correctly, the main table had about
> 750.000 rows, of which one varchar(25) field was fti'ed ('full text indexed'
> :),
> resulting in some 5 million rows in the fti table.
>
> wrt file sizes, I don't really remember. If you're really interested, I can
> make another setup to check this (over easter time).
>
> I'm curious: Bruce mentioned buffer cache sizes. What exactly is this buffer
> cache
> used for? I thought we relied on the OS filesystem caching to cache database
> files?
> What will increasing buffer caches give me?
The PostgreSQL shared buffers are used by the database to read/write 8k
db buffers. The OS has buffers two, so there is some duplication. Ours
exist in shared memory so all backends can use the information and
mark/flush them as needed. Increasing the shared buffer cache will keep
more buffers availible, but frankly the OS buffer cache is just/more
important. It is when the stuff is in neither cache and we have to go
to disk thousands of time for one query that things get bad.
--
Bruce Momjian | http://www.op.net/~candle
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026