Обсуждение: Tsearch2 Initial Search Speed

Поиск
Список
Период
Сортировка

Tsearch2 Initial Search Speed

От
Howard Cole
Дата:
Hi, I am looking to improve the initial query speed for the following query:

select email_id from email, to_tsquery('default','example') as q where
q@@fts;

This is running on 8.2.4 on Windows Server 2K3.

The initial output from explain analyse is as follows.

"Nested Loop  (cost=8.45..76.70 rows=18 width=8) (actual
time=5776.347..27364.248 rows=14938 loops=1)"
"  ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual
time=0.023..0.024 rows=1 loops=1)"
"  ->  Bitmap Heap Scan on email  (cost=8.45..76.46 rows=18 width=322)
(actual time=5776.314..27353.344 rows=14938 loops=1)"
"        Filter: (q.q @@ email.fts)"
"        ->  Bitmap Index Scan on email_fts_index  (cost=0.00..8.44
rows=18 width=0) (actual time=5763.355..5763.355 rows=15118 loops=1)"
"              Index Cond: (q.q @@ email.fts)"
"Total runtime: 27369.091 ms"

Subsequent output is considerably faster. (I am guessing that is because
email_fts_index is cached.

"Nested Loop  (cost=8.45..76.70 rows=18 width=8) (actual
time=29.241..264.712 rows=14938 loops=1)"
"  ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual
time=0.008..0.010 rows=1 loops=1)"
"  ->  Bitmap Heap Scan on email  (cost=8.45..76.46 rows=18 width=322)
(actual time=29.224..256.135 rows=14938 loops=1)"
"        Filter: (q.q @@ email.fts)"
"        ->  Bitmap Index Scan on email_fts_index  (cost=0.00..8.44
rows=18 width=0) (actual time=28.344..28.344 rows=15118 loops=1)"
"              Index Cond: (q.q @@ email.fts)"
"Total runtime: 268.663 ms"

The table contains text derived from emails and therefore its contents
and the searches can vary wildly.

Table construction as follows:

CREATE TABLE email
(
  email_id bigint NOT NULL DEFAULT
nextval(('public.email_email_id_seq'::text)::regclass),
  send_to text NOT NULL DEFAULT ''::text,
  reply_from character varying(100) NOT NULL DEFAULT ''::character varying,
  cc text NOT NULL DEFAULT ''::text,
  bcc text NOT NULL DEFAULT ''::text,
  subject text NOT NULL DEFAULT ''::text,
  "content" text NOT NULL DEFAULT ''::text,
  time_tx_rx timestamp without time zone NOT NULL DEFAULT now(),
  fts tsvector,
  CONSTRAINT email_pkey PRIMARY KEY (email_id),
)
WITH (OIDS=FALSE);

-- Index: email_fts_index

CREATE INDEX email_fts_index
  ON email
  USING gist
  (fts);

CREATE INDEX email_mailbox_id_idx
  ON email
  USING btree
  (mailbox_id);


-- Trigger: fts_trigger on email
CREATE TRIGGER fts_trigger
  BEFORE INSERT OR UPDATE
  ON email
  FOR EACH ROW
  EXECUTE PROCEDURE tsearch2('fts', 'send_to', 'reply_from', 'cc',
'content', 'subject');


Re: Tsearch2 Initial Search Speed

От
Alan Hodgson
Дата:
On Monday 16 June 2008, Howard Cole <howardnews@selestial.com> wrote:
> Hi, I am looking to improve the initial query speed for the following
> query:
>
> select email_id from email, to_tsquery('default','example') as q where
> q@@fts;
>
> This is running on 8.2.4 on Windows Server 2K3.
>
> The initial output from explain analyse is as follows.
>
> "Nested Loop  (cost=8.45..76.70 rows=18 width=8) (actual
> time=5776.347..27364.248 rows=14938 loops=1)"
> "  ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual
> time=0.023..0.024 rows=1 loops=1)"
> "  ->  Bitmap Heap Scan on email  (cost=8.45..76.46 rows=18 width=322)
> (actual time=5776.314..27353.344 rows=14938 loops=1)"
> "        Filter: (q.q @@ email.fts)"
> "        ->  Bitmap Index Scan on email_fts_index  (cost=0.00..8.44
> rows=18 width=0) (actual time=5763.355..5763.355 rows=15118 loops=1)"
> "              Index Cond: (q.q @@ email.fts)"
> "Total runtime: 27369.091 ms"
>
> Subsequent output is considerably faster. (I am guessing that is because
> email_fts_index is cached.

It's because everything is cached, in particular the relevant rows from
the "email" table (accessing which took 22 of the original 27 seconds).

The plan looks good for what it's doing.

I don't see that query getting much faster unless you could add a lot more
cache RAM; 30K random IOs off disk is going to take a fair bit of time
regardless of what you do.

--
Alan

Re: Tsearch2 Initial Search Speed

От
Howard Cole
Дата:
Alan Hodgson wrote:
> It's because everything is cached, in particular the relevant rows from
> the "email" table (accessing which took 22 of the original 27 seconds).
>
> The plan looks good for what it's doing.
>
> I don't see that query getting much faster unless you could add a lot more
> cache RAM; 30K random IOs off disk is going to take a fair bit of time
> regardless of what you do.
>
>

Thanks Alan, I guessed that the caching was the difference, but I do not
understand why there is a heap scan on the email table? The query seems
to use the email_fts_index correctly, which only takes 6 seconds, why
does it then need to scan the email table?

Sorry If I sound a bit stupid - I am not very experienced with the
analyse statement.

Re: Tsearch2 Initial Search Speed

От
Howard Cole
Дата:
I think I may have answered my own question partially, the problem may
be how I structure the query.

I always structured my tsearch queries as follows following my initial
read of the tsearch2 instructions...

select email_id from email, to_tsquery('default', 'howard') as q where
q@@fts;

However if I construct them in the following way, as stipulated in the
8.3 documentation....

select email_id from email where fts@@to_tsquery('default','howard')

Then the results are better due to the fact that the email table is not
necessarily scanned as can be seen from the two analyse statements:

Original statement:

"Nested Loop  (cost=4.40..65.08 rows=16 width=8)"
"  ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32)"
"  ->  Bitmap Heap Scan on email  (cost=4.40..64.87 rows=16 width=489)"
"        Filter: (email.fts @@ q.q)"
"        ->  Bitmap Index Scan on email_fts_index  (cost=0.00..4.40
rows=16 width=0)"
"              Index Cond: (email.fts @@ q.q)"

Second statement:

"Bitmap Heap Scan on email  (cost=4.40..64.91 rows=16 width=8)"
"  Filter: (fts @@ '''howard'''::tsquery)"
"  ->  Bitmap Index Scan on email_fts_index  (cost=0.00..4.40 rows=16
width=0)"
"        Index Cond: (fts @@ '''howard'''::tsquery)"

This misses out the random access of the email table, turning my 27
second query into 6 seconds.

I guess the construction of the first statement effectively stops the
query optimisation from working.


Re: Tsearch2 Initial Search Speed

От
Matthew Wakeling
Дата:
On Tue, 17 Jun 2008, Howard Cole wrote:
> Alan Hodgson wrote:
>> It's because everything is cached, in particular the relevant rows from the
>> "email" table (accessing which took 22 of the original 27 seconds).

> Thanks Alan, I guessed that the caching was the difference, but I do not
> understand why there is a heap scan on the email table? The query seems to
> use the email_fts_index correctly, which only takes 6 seconds, why does it
> then need to scan the email table?

It's not a sequential scan - that really would take a fair time. It's a
bitmap heap scan - that is, it has built a bitmap of the rows needed by
using the index, and now it needs to fetch all those rows from the email
table. There's 14938 of them, and they're likely scattered all over the
table, so you'll probably have to do 14938 seeks on the disc. At 5ms a
pop, that would be 70 seconds, so count yourself lucky it only takes 22
seconds instead!

If you aren't actually interested in having all 14938 rows splurged at
you, try using the LIMIT keyword at the end of the query. That would make
it run a bit faster, and would make sense if you only want to display the
first twenty on a web page or something.

Matthew

--
For every complex problem, there is a solution that is simple, neat, and wrong.
                                                      -- H. L. Mencken

Re: Tsearch2 Initial Search Speed

От
Matthew Wakeling
Дата:
On Tue, 17 Jun 2008, Howard Cole wrote:
> I think I may have answered my own question partially, the problem may be how
> I structure the query.
>
> Original statement:
>
> "Nested Loop  (cost=4.40..65.08 rows=16 width=8)"
> "  ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32)"
> "  ->  Bitmap Heap Scan on email  (cost=4.40..64.87 rows=16 width=489)"
> "        Filter: (email.fts @@ q.q)"
> "        ->  Bitmap Index Scan on email_fts_index  (cost=0.00..4.40 rows=16 width=0)"
> "              Index Cond: (email.fts @@ q.q)"
>
> Second statement:
>
> "Bitmap Heap Scan on email  (cost=4.40..64.91 rows=16 width=8)"
> "  Filter: (fts @@ '''howard'''::tsquery)"
> "  ->  Bitmap Index Scan on email_fts_index  (cost=0.00..4.40 rows=16 width=0)"
> "        Index Cond: (fts @@ '''howard'''::tsquery)"

As far as I can see, that shouldn't make any difference. Both queries
still do the bitmap heap scan, and have almost exactly the same cost.

Matthew

--
Lord grant me patience, and I want it NOW!

Re: Tsearch2 Initial Search Speed

От
Howard Cole
Дата:
> As far as I can see, that shouldn't make any difference. Both queries
> still do the bitmap heap scan, and have almost exactly the same cost.
>
> Matthew
>
You may have a point there Matthew, they both appear to do a scan on the
email table (Why?). But for whatever reason, I swear the second method
is significantly faster! If I run the new style query first, then the
original style (to_tsquery as q) then the original style still takes
longer, even with the new style cached!

Incidentally, how can I clear the cache in between queries?

Re: Tsearch2 Initial Search Speed

От
Matthew Wakeling
Дата:
On Tue, 17 Jun 2008, Howard Cole wrote:
> They both appear to do a scan on the email table (Why?).

The indexes don't contain copies of the row data. They only contain
pointers to the rows in the table. So once the index has been consulted,
Postgres still needs to look at the table to fetch the actual rows. Of
course, it only needs to bother looking where the index points, and that
is the benefit of an index.

Matthew

--
I've run DOOM more in the last few days than I have the last few
months.  I just love debugging ;-)  -- Linus Torvalds

Re: Tsearch2 Initial Search Speed

От
Howard Cole
Дата:
Matthew Wakeling wrote:
> On Tue, 17 Jun 2008, Howard Cole wrote:
>> They both appear to do a scan on the email table (Why?).
>
> The indexes don't contain copies of the row data. They only contain
> pointers to the rows in the table. So once the index has been
> consulted, Postgres still needs to look at the table to fetch the
> actual rows. Of course, it only needs to bother looking where the
> index points, and that is the benefit of an index.
>
> Matthew
>
Thanks for your patience with me here Matthew, But what I don't
understand is why it needs to do a scan on email. If I do a query that
uses another index, then it uses the index only and does not scan the
email table. The scan on the fts index takes 6 seconds, which presumably
returns email_id's (the email_id being the primary key) - what does it
then need from the email table that takes 22 seconds?

e.g.

triohq=> explain select email_id from email where email_directory_id=1;
                                         QUERY PLAN

--------------------------------------------------------------------------------
-------------
 Index Scan using email_email_directory_id_idx on email
(cost=0.00..129.01 rows
=35 width=8)
   Index Cond: (email_directory_id = 1)
(2 rows)

Re: Tsearch2 Initial Search Speed

От
Matthew Wakeling
Дата:
On Tue, 17 Jun 2008, Howard Cole wrote:
> If I do a query that uses another index, then it uses the index only and
> does not scan the email table.

Not true. It only looks a little bit like that from the explain output.
However, if you look closely:

> Index Scan using email_email_directory_id_idx on email  (cost=0.00..129.01 rows=35 width=8)
>  Index Cond: (email_directory_id = 1)
> (2 rows)

It's a scan *using* the index, but *on* the table "email". This index scan
is having to read the email table too.

> The scan on the fts index takes 6 seconds, which presumably returns
> email_id's (the email_id being the primary key) - what does it then need
> from the email table that takes 22 seconds?

Actually, the index returns page numbers in the table on disc which may
contain one or more rows that are relevant. Postgres has to fetch the
whole row to find out the email_id and any other information, including
whether the row is visible in your current transaction (concurrency
control complicates it all). Just having a page number isn't much use to
you!

Matthew

--
First law of computing:  Anything can go wro
sig: Segmentation fault.  core dumped.

Re: Tsearch2 Initial Search Speed

От
Howard Cole
Дата:
> Actually, the index returns page numbers in the table on disc which
> may contain one or more rows that are relevant. Postgres has to fetch
> the whole row to find out the email_id and any other information,
> including whether the row is visible in your current transaction
> (concurrency control complicates it all). Just having a page number
> isn't much use to you!
>
> Matthew
>
I learn something new every day.

Thanks Matthew.

Re: Tsearch2 Initial Search Speed

От
Alan Hodgson
Дата:
On Tuesday 17 June 2008, Howard Cole <howardnews@selestial.com> wrote:
> This misses out the random access of the email table, turning my 27
> second query into 6 seconds.

It took less time because it retrieved a lot less data - it still has to
look at the table.

--
Alan

Re: Tsearch2 Initial Search Speed

От
Alan Hodgson
Дата:
On Tuesday 17 June 2008, Howard Cole <howardnews@selestial.com> wrote:
> Incidentally, how can I clear the cache in between queries?

Stop PostgreSQL, unmount the filesystem it's on, remount it, restart
PostgreSQL. Works under Linux.

If it's on a filesystem you can't unmount hot, you'll need to reboot.

--
Alan

Re: Tsearch2 Initial Search Speed

От
Matthew Wakeling
Дата:
On Tue, 17 Jun 2008, Alan Hodgson wrote:
> On Tuesday 17 June 2008, Howard Cole <howardnews@selestial.com> wrote:
>> Incidentally, how can I clear the cache in between queries?
>
> Stop PostgreSQL, unmount the filesystem it's on, remount it, restart
> PostgreSQL. Works under Linux.
>
> If it's on a filesystem you can't unmount hot, you'll need to reboot.

Not true - on recent Linux kernels, you can drop the OS cache by running

echo "1" >/proc/sys/vm/drop_caches

as root. You'll still need to restart Postgres to drop its cache too.

Matthew

--
Richards' Laws of Data Security:
 1. Don't buy a computer.
 2. If you must buy a computer, don't turn it on.

Re: Tsearch2 Initial Search Speed

От
Howard Cole
Дата:
>
> Actually, the index returns page numbers in the table on disc which
> may contain one or more rows that are relevant. Postgres has to fetch
> the whole row to find out the email_id and any other information,
> including whether the row is visible in your current transaction
> (concurrency control complicates it all). Just having a page number
> isn't much use to you!
>
> Matthew
>
Out of interest, if I could create a multicolumn index with both the
primary key and the fts key (I don't think I can create a multi-column
index using GIST with both the email_id and the fts field), would this
reduce access to the table due to the primary key being part of the index?

More importantly, are there other ways that I can improve performance on
this? I am guessing that a lot of the problem is that the email table is
so big. If I cut out some of the text fields that are not needed in the
search and put them in another table, presumably the size of the table
will be reduced to a point where it will reduce the number of disk hits
and speed the query up.

So I could split the table into two parts:

create table email_part2 (
email_id int8 references email_part1 (email_id),
fts ...,
email_directory_id ...,
)

create table email_part1(
email_id serial8 primary key,
cc text,
bcc text,
...
)

and the query will be
select email_id from email_part2 where to_tsquery('default', 'howard')
@@ fts;

Re: Tsearch2 Initial Search Speed

От
Matthew Wakeling
Дата:
On Wed, 18 Jun 2008, Howard Cole wrote:
> Out of interest, if I could create a multicolumn index with both the primary
> key and the fts key (I don't think I can create a multi-column index using
> GIST with both the email_id and the fts field), would this reduce access to
> the table due to the primary key being part of the index?

Unfortunately not, since the indexes do not contain information on whether
a particular row is visible in your current transaction. Like I said,
concurrency control really complicates things!

> More importantly, are there other ways that I can improve performance on
> this? I am guessing that a lot of the problem is that the email table is so
> big. If I cut out some of the text fields that are not needed in the search
> and put them in another table, presumably the size of the table will be
> reduced to a point where it will reduce the number of disk hits and speed the
> query up.

Good idea. Note that Postgres is already doing this to some extent with
TOAST - read
http://www.postgresql.org/docs/8.3/interactive/storage-toast.html -
unfortunately, there doesn't seem to be an option to always move
particular columns out to TOAST. Your idea will produce an even smaller
table. However, are email_ids all that you want from the query?

Matthew

--
Okay, I'm weird! But I'm saving up to be eccentric.

Re: Tsearch2 Initial Search Speed

От
Howard Cole
Дата:
> Good idea. Note that Postgres is already doing this to some extent
> with TOAST - read
> http://www.postgresql.org/docs/8.3/interactive/storage-toast.html -
> unfortunately, there doesn't seem to be an option to always move
> particular columns out to TOAST. Your idea will produce an even
> smaller table. However, are email_ids all that you want from the query?
>
> Matthew
>
As you point out - I will need more then the email_ids in the query, but
if I remove just the content, to, cc fields then the size of the table
should shrink dramatically. Just remains to be seen if the TOAST has
already done that optimisation for me.

Again. Thanks Matthew - I owe you a beer.

Re: Tsearch2 Initial Search Speed

От
Howard Cole
Дата:
PFC wrote:
>> Hi, I am looking to improve the initial query speed for the following
>> query:
>
>     Try Xapian full text search engine, it behaves much better than
> tsearch when the dataset exceeds your memory cache size.
>
> __________ Information from ESET NOD32 Antivirus, version of virus
> signature database 3202 (20080620) __________
>
> The message was checked by ESET NOD32 Antivirus.
>
> http://www.eset.com
>
>
>
Thanks for the heads-up PFC, but I prefer the tsearch2 license.