Re: Tsearch2 Initial Search Speed

Поиск
Список
Период
Сортировка
Искать

Re: Tsearch2 Initial Search Speed

От:
Alan Hodgson <ahodgson@simkin.ca>
Дата:
On Monday 16 June 2008, Howard Cole  wrote:
> Hi, I am looking to improve the initial query speed for the following
> query:
>
> select email_id from email, to_tsquery('default','example') as q where
> q@@fts;
>
> This is running on 8.2.4 on Windows Server 2K3.
>
> The initial output from explain analyse is as follows.
>
> "Nested Loop  (cost=8.45..76.70 rows=18 width=8) (actual
> time=5776.347..27364.248 rows=14938 loops=1)"
> "  ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual
> time=0.023..0.024 rows=1 loops=1)"
> "  ->  Bitmap Heap Scan on email  (cost=8.45..76.46 rows=18 width=322)
> (actual time=5776.314..27353.344 rows=14938 loops=1)"
> "        Filter: (q.q @@ email.fts)"
> "        ->  Bitmap Index Scan on email_fts_index  (cost=0.00..8.44
> rows=18 width=0) (actual time=5763.355..5763.355 rows=15118 loops=1)"
> "              Index Cond: (q.q @@ email.fts)"
> "Total runtime: 27369.091 ms"
>
> Subsequent output is considerably faster. (I am guessing that is because
> email_fts_index is cached.

It's because everything is cached, in particular the relevant rows from 
the "email" table (accessing which took 22 of the original 27 seconds).

The plan looks good for what it's doing.

I don't see that query getting much faster unless you could add a lot more 
cache RAM; 30K random IOs off disk is going to take a fair bit of time 
regardless of what you do. 

-- 
Alan

Re: Tsearch2 Initial Search Speed

От:
Alan Hodgson <ahodgson@simkin.ca>
Дата:
On Tuesday 17 June 2008, Howard Cole  wrote:
> This misses out the random access of the email table, turning my 27
> second query into 6 seconds.

It took less time because it retrieved a lot less data - it still has to 
look at the table.

-- 
Alan

Re: Tsearch2 Initial Search Speed

От:
Alan Hodgson <ahodgson@simkin.ca>
Дата:
On Tuesday 17 June 2008, Howard Cole  wrote:
> Incidentally, how can I clear the cache in between queries?

Stop PostgreSQL, unmount the filesystem it's on, remount it, restart 
PostgreSQL. Works under Linux.

If it's on a filesystem you can't unmount hot, you'll need to reboot.

-- 
Alan

Tsearch2 Initial Search Speed

От:
Howard Cole <howardnews@selestial.com>
Дата:
Hi, I am looking to improve the initial query speed for the following query:

select email_id from email, to_tsquery('default','example') as q where 
q@@fts;

This is running on 8.2.4 on Windows Server 2K3.

The initial output from explain analyse is as follows.

"Nested Loop  (cost=8.45..76.70 rows=18 width=8) (actual 
time=5776.347..27364.248 rows=14938 loops=1)"
"  ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual 
time=0.023..0.024 rows=1 loops=1)"
"  ->  Bitmap Heap Scan on email  (cost=8.45..76.46 rows=18 width=322) 
(actual time=5776.314..27353.344 rows=14938 loops=1)"
"        Filter: (q.q @@ email.fts)"
"        ->  Bitmap Index Scan on email_fts_index  (cost=0.00..8.44 
rows=18 width=0) (actual time=5763.355..5763.355 rows=15118 loops=1)"
"              Index Cond: (q.q @@ email.fts)"
"Total runtime: 27369.091 ms"

Subsequent output is considerably faster. (I am guessing that is because 
email_fts_index is cached.

"Nested Loop  (cost=8.45..76.70 rows=18 width=8) (actual 
time=29.241..264.712 rows=14938 loops=1)"
"  ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual 
time=0.008..0.010 rows=1 loops=1)"
"  ->  Bitmap Heap Scan on email  (cost=8.45..76.46 rows=18 width=322) 
(actual time=29.224..256.135 rows=14938 loops=1)"
"        Filter: (q.q @@ email.fts)"
"        ->  Bitmap Index Scan on email_fts_index  (cost=0.00..8.44 
rows=18 width=0) (actual time=28.344..28.344 rows=15118 loops=1)"
"              Index Cond: (q.q @@ email.fts)"
"Total runtime: 268.663 ms"

The table contains text derived from emails and therefore its contents 
and the searches can vary wildly.

Table construction as follows:

CREATE TABLE email
(
  email_id bigint NOT NULL DEFAULT 
nextval(('public.email_email_id_seq'::text)::regclass),
  send_to text NOT NULL DEFAULT ''::text,
  reply_from character varying(100) NOT NULL DEFAULT ''::character varying,
  cc text NOT NULL DEFAULT ''::text,
  bcc text NOT NULL DEFAULT ''::text,
  subject text NOT NULL DEFAULT ''::text,
  "content" text NOT NULL DEFAULT ''::text,
  time_tx_rx timestamp without time zone NOT NULL DEFAULT now(),
  fts tsvector,
  CONSTRAINT email_pkey PRIMARY KEY (email_id),
)
WITH (OIDS=FALSE);

-- Index: email_fts_index

CREATE INDEX email_fts_index
  ON email
  USING gist
  (fts);

CREATE INDEX email_mailbox_id_idx
  ON email
  USING btree
  (mailbox_id);


-- Trigger: fts_trigger on email
CREATE TRIGGER fts_trigger
  BEFORE INSERT OR UPDATE
  ON email
  FOR EACH ROW
  EXECUTE PROCEDURE tsearch2('fts', 'send_to', 'reply_from', 'cc', 
'content', 'subject');

Re: Tsearch2 Initial Search Speed

От:
Howard Cole <howardnews@selestial.com>
Дата:
Alan Hodgson wrote:
> It's because everything is cached, in particular the relevant rows from 
> the "email" table (accessing which took 22 of the original 27 seconds).
>
> The plan looks good for what it's doing.
>
> I don't see that query getting much faster unless you could add a lot more 
> cache RAM; 30K random IOs off disk is going to take a fair bit of time 
> regardless of what you do. 
>
>   

Thanks Alan, I guessed that the caching was the difference, but I do not 
understand why there is a heap scan on the email table? The query seems 
to use the email_fts_index correctly, which only takes 6 seconds, why 
does it then need to scan the email table?

Sorry If I sound a bit stupid - I am not very experienced with the 
analyse statement.

Re: Tsearch2 Initial Search Speed

От:
Howard Cole <howardnews@selestial.com>
Дата:
I think I may have answered my own question partially, the problem may 
be how I structure the query.

I always structured my tsearch queries as follows following my initial 
read of the tsearch2 instructions...

select email_id from email, to_tsquery('default', 'howard') as q where 
q@@fts;

However if I construct them in the following way, as stipulated in the 
8.3 documentation....

select email_id from email where fts@@to_tsquery('default','howard')

Then the results are better due to the fact that the email table is not 
necessarily scanned as can be seen from the two analyse statements:

Original statement:

"Nested Loop  (cost=4.40..65.08 rows=16 width=8)"
"  ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32)"
"  ->  Bitmap Heap Scan on email  (cost=4.40..64.87 rows=16 width=489)"
"        Filter: (email.fts @@ q.q)"
"        ->  Bitmap Index Scan on email_fts_index  (cost=0.00..4.40 
rows=16 width=0)"
"              Index Cond: (email.fts @@ q.q)"

Second statement:

"Bitmap Heap Scan on email  (cost=4.40..64.91 rows=16 width=8)"
"  Filter: (fts @@ '''howard'''::tsquery)"
"  ->  Bitmap Index Scan on email_fts_index  (cost=0.00..4.40 rows=16 
width=0)"
"        Index Cond: (fts @@ '''howard'''::tsquery)"

This misses out the random access of the email table, turning my 27 
second query into 6 seconds.

I guess the construction of the first statement effectively stops the 
query optimisation from working.

Re: Tsearch2 Initial Search Speed

От:
Howard Cole <howardnews@selestial.com>
Дата:

> As far as I can see, that shouldn't make any difference. Both queries 
> still do the bitmap heap scan, and have almost exactly the same cost.
>
> Matthew
>
You may have a point there Matthew, they both appear to do a scan on the 
email table (Why?). But for whatever reason, I swear the second method 
is significantly faster! If I run the new style query first, then the 
original style (to_tsquery as q) then the original style still takes 
longer, even with the new style cached!

Incidentally, how can I clear the cache in between queries?

Re: Tsearch2 Initial Search Speed

От:
Howard Cole <howardnews@selestial.com>
Дата:
Matthew Wakeling wrote:
> On Tue, 17 Jun 2008, Howard Cole wrote:
>> They both appear to do a scan on the email table (Why?).
>
> The indexes don't contain copies of the row data. They only contain 
> pointers to the rows in the table. So once the index has been 
> consulted, Postgres still needs to look at the table to fetch the 
> actual rows. Of course, it only needs to bother looking where the 
> index points, and that is the benefit of an index.
>
> Matthew
>
Thanks for your patience with me here Matthew, But what I don't 
understand is why it needs to do a scan on email. If I do a query that 
uses another index, then it uses the index only and does not scan the 
email table. The scan on the fts index takes 6 seconds, which presumably 
returns email_id's (the email_id being the primary key) - what does it 
then need from the email table that takes 22 seconds?

e.g.

triohq=> explain select email_id from email where email_directory_id=1;
                                         QUERY PLAN

--------------------------------------------------------------------------------
-------------
 Index Scan using email_email_directory_id_idx on email  
(cost=0.00..129.01 rows
=35 width=8)
   Index Cond: (email_directory_id = 1)
(2 rows)

Re: Tsearch2 Initial Search Speed

От:
Howard Cole <howardnews@selestial.com>
Дата:

> Actually, the index returns page numbers in the table on disc which 
> may contain one or more rows that are relevant. Postgres has to fetch 
> the whole row to find out the email_id and any other information, 
> including whether the row is visible in your current transaction 
> (concurrency control complicates it all). Just having a page number 
> isn't much use to you!
>
> Matthew
>
I learn something new every day.

Thanks Matthew.

Re: Tsearch2 Initial Search Speed

От:
Howard Cole <howardnews@selestial.com>
Дата:

>
> Actually, the index returns page numbers in the table on disc which 
> may contain one or more rows that are relevant. Postgres has to fetch 
> the whole row to find out the email_id and any other information, 
> including whether the row is visible in your current transaction 
> (concurrency control complicates it all). Just having a page number 
> isn't much use to you!
>
> Matthew
>
Out of interest, if I could create a multicolumn index with both the 
primary key and the fts key (I don't think I can create a multi-column 
index using GIST with both the email_id and the fts field), would this 
reduce access to the table due to the primary key being part of the index?

More importantly, are there other ways that I can improve performance on 
this? I am guessing that a lot of the problem is that the email table is 
so big. If I cut out some of the text fields that are not needed in the 
search and put them in another table, presumably the size of the table 
will be reduced to a point where it will reduce the number of disk hits 
and speed the query up.

So I could split the table into two parts:

create table email_part2 (
email_id int8 references email_part1 (email_id),
fts ...,
email_directory_id ...,
)

create table email_part1(
email_id serial8 primary key,
cc text,
bcc text,
...
)

and the query will be
select email_id from email_part2 where to_tsquery('default', 'howard') 
@@ fts;

Re: Tsearch2 Initial Search Speed

От:
Howard Cole <howardnews@selestial.com>
Дата:

> Good idea. Note that Postgres is already doing this to some extent 
> with TOAST - read 
> http://www.postgresql.org/docs/8.3/interactive/storage-toast.html - 
> unfortunately, there doesn't seem to be an option to always move 
> particular columns out to TOAST. Your idea will produce an even 
> smaller table. However, are email_ids all that you want from the query?
>
> Matthew
>
As you point out - I will need more then the email_ids in the query, but 
if I remove just the content, to, cc fields then the size of the table 
should shrink dramatically. Just remains to be seen if the TOAST has 
already done that optimisation for me.

Again. Thanks Matthew - I owe you a beer.

Re: Tsearch2 Initial Search Speed

От:
Howard Cole <howardnews@selestial.com>
Дата:
PFC wrote:
>> Hi, I am looking to improve the initial query speed for the following 
>> query:
>
>     Try Xapian full text search engine, it behaves much better than 
> tsearch when the dataset exceeds your memory cache size.
>
> __________ Information from ESET NOD32 Antivirus, version of virus 
> signature database 3202 (20080620) __________
>
> The message was checked by ESET NOD32 Antivirus.
>
> http://www.eset.com
>
>
>
Thanks for the heads-up PFC, but I prefer the tsearch2 license.

Re: Tsearch2 Initial Search Speed

От:
Matthew Wakeling <matthew@flymine.org>
Дата:
On Tue, 17 Jun 2008, Howard Cole wrote:
> Alan Hodgson wrote:
>> It's because everything is cached, in particular the relevant rows from the 
>> "email" table (accessing which took 22 of the original 27 seconds).

> Thanks Alan, I guessed that the caching was the difference, but I do not 
> understand why there is a heap scan on the email table? The query seems to 
> use the email_fts_index correctly, which only takes 6 seconds, why does it 
> then need to scan the email table?

It's not a sequential scan - that really would take a fair time. It's a 
bitmap heap scan - that is, it has built a bitmap of the rows needed by 
using the index, and now it needs to fetch all those rows from the email 
table. There's 14938 of them, and they're likely scattered all over the 
table, so you'll probably have to do 14938 seeks on the disc. At 5ms a 
pop, that would be 70 seconds, so count yourself lucky it only takes 22 
seconds instead!

If you aren't actually interested in having all 14938 rows splurged at 
you, try using the LIMIT keyword at the end of the query. That would make 
it run a bit faster, and would make sense if you only want to display the 
first twenty on a web page or something.

Matthew

-- 
For every complex problem, there is a solution that is simple, neat, and wrong.
                                                      -- H. L. Mencken 

Re: Tsearch2 Initial Search Speed

От:
Matthew Wakeling <matthew@flymine.org>
Дата:
On Tue, 17 Jun 2008, Howard Cole wrote:
> I think I may have answered my own question partially, the problem may be how 
> I structure the query.
>
> Original statement:
>
> "Nested Loop  (cost=4.40..65.08 rows=16 width=8)"
> "  ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32)"
> "  ->  Bitmap Heap Scan on email  (cost=4.40..64.87 rows=16 width=489)"
> "        Filter: (email.fts @@ q.q)"
> "        ->  Bitmap Index Scan on email_fts_index  (cost=0.00..4.40 rows=16 width=0)"
> "              Index Cond: (email.fts @@ q.q)"
>
> Second statement:
>
> "Bitmap Heap Scan on email  (cost=4.40..64.91 rows=16 width=8)"
> "  Filter: (fts @@ '''howard'''::tsquery)"
> "  ->  Bitmap Index Scan on email_fts_index  (cost=0.00..4.40 rows=16 width=0)"
> "        Index Cond: (fts @@ '''howard'''::tsquery)"

As far as I can see, that shouldn't make any difference. Both queries 
still do the bitmap heap scan, and have almost exactly the same cost.

Matthew

-- 
Lord grant me patience, and I want it NOW!

Re: Tsearch2 Initial Search Speed

От:
Matthew Wakeling <matthew@flymine.org>
Дата:
On Tue, 17 Jun 2008, Howard Cole wrote:
> They both appear to do a scan on the email table (Why?).

The indexes don't contain copies of the row data. They only contain 
pointers to the rows in the table. So once the index has been consulted, 
Postgres still needs to look at the table to fetch the actual rows. Of 
course, it only needs to bother looking where the index points, and that 
is the benefit of an index.

Matthew

-- 
I've run DOOM more in the last few days than I have the last few
months.  I just love debugging ;-)  -- Linus Torvalds

Re: Tsearch2 Initial Search Speed

От:
Matthew Wakeling <matthew@flymine.org>
Дата:
On Tue, 17 Jun 2008, Howard Cole wrote:
> If I do a query that uses another index, then it uses the index only and 
> does not scan the email table.

Not true. It only looks a little bit like that from the explain output. 
However, if you look closely:

> Index Scan using email_email_directory_id_idx on email  (cost=0.00..129.01 rows=35 width=8)
>  Index Cond: (email_directory_id = 1)
> (2 rows)

It's a scan *using* the index, but *on* the table "email". This index scan 
is having to read the email table too.

> The scan on the fts index takes 6 seconds, which presumably returns 
> email_id's (the email_id being the primary key) - what does it then need 
> from the email table that takes 22 seconds?

Actually, the index returns page numbers in the table on disc which may 
contain one or more rows that are relevant. Postgres has to fetch the 
whole row to find out the email_id and any other information, including 
whether the row is visible in your current transaction (concurrency 
control complicates it all). Just having a page number isn't much use to 
you!

Matthew

-- 
First law of computing:  Anything can go wro
sig: Segmentation fault.  core dumped.

Re: Tsearch2 Initial Search Speed

От:
Matthew Wakeling <matthew@flymine.org>
Дата:
On Tue, 17 Jun 2008, Alan Hodgson wrote:
> On Tuesday 17 June 2008, Howard Cole  wrote:
>> Incidentally, how can I clear the cache in between queries?
>
> Stop PostgreSQL, unmount the filesystem it's on, remount it, restart
> PostgreSQL. Works under Linux.
>
> If it's on a filesystem you can't unmount hot, you'll need to reboot.

Not true - on recent Linux kernels, you can drop the OS cache by running

echo "1" >/proc/sys/vm/drop_caches

as root. You'll still need to restart Postgres to drop its cache too.

Matthew

-- 
Richards' Laws of Data Security:
 1. Don't buy a computer.
 2. If you must buy a computer, don't turn it on.

Re: Tsearch2 Initial Search Speed

От:
Matthew Wakeling <matthew@flymine.org>
Дата:
On Wed, 18 Jun 2008, Howard Cole wrote:
> Out of interest, if I could create a multicolumn index with both the primary 
> key and the fts key (I don't think I can create a multi-column index using 
> GIST with both the email_id and the fts field), would this reduce access to 
> the table due to the primary key being part of the index?

Unfortunately not, since the indexes do not contain information on whether 
a particular row is visible in your current transaction. Like I said, 
concurrency control really complicates things!

> More importantly, are there other ways that I can improve performance on 
> this? I am guessing that a lot of the problem is that the email table is so 
> big. If I cut out some of the text fields that are not needed in the search 
> and put them in another table, presumably the size of the table will be 
> reduced to a point where it will reduce the number of disk hits and speed the 
> query up.

Good idea. Note that Postgres is already doing this to some extent with 
TOAST - read 
http://www.postgresql.org/docs/8.3/interactive/storage-toast.html - 
unfortunately, there doesn't seem to be an option to always move 
particular columns out to TOAST. Your idea will produce an even smaller 
table. However, are email_ids all that you want from the query?

Matthew

-- 
Okay, I'm weird! But I'm saving up to be eccentric.
FAQ