Обсуждение: speeding up a query on a large table

Поиск
Список
Период
Сортировка

speeding up a query on a large table

От
Kevin Murphy
Дата:
I'm trying to speed up a query on a text column of a 14M-row table.
Uncached query times vary between 1-20 seconds (maybe more), depending
on the search
term. In between time trials I've been trying to flush the disk buffer
cache by selecting count(*) from a  separate 4GB table, and times are pretty
consistent for a given search term.  If a search term hasn't been seen
in a while,
or my full table scan has purged memory, a query may take 20 seconds,
whereas
 if the relevant pages are cached, it may take 8.5 milliseconds.  I'd
really like to
avoid that 20 second turn-off for users.

I'm guessing I need to put lots of RAM in this machine (currently only 2GB;
max 8GB) and somehow pre-load the entire index?  The production version
of this
database will be read-only, btw.  Because this index will be key to the
whole
application, and because the number of possible search terms is so large, it
would be nice if the entire index could somehow be preloaded into memory
and
encouraged to stay there.  Alternatively, we could prime the disk buffers by
doing searches on what we guess will be the most common terms.  I wasn't
paying
attention to the recent thread about ram disks, but maybe this is a
situation that
might call for one?

The rest of this message contains details about the situation.

Thanks for the advice, as usual!

-Kevin Murphy

Sample query:

explain analyze select * from tagged_genes where mention = 'bcl2';
                                                                   QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using tagged_genes_mention_idx on tagged_genes
(cost=0.00..327.64 rows=81 width=67) (actual time=28.694..5544.779
rows=848 loops=1)
   Index Cond: (mention = 'bcl2'::text)
 Total runtime: 5545.434 ms

The index of the query column (mention) is 226,229 pages (= 1.7 GB?).
There are 1.3M unique values in that column.  I've run 'vacuum analyze'
on the table.  I also tried setting the statistics target to 1000, and
it did
speed up searches for some search terms.

Just out of curiosity, and because I am interested in possibly using
tsearch2 in the future, I created a tsvector column indexing the
mention column (and added a GIST index and vacuum-analyzed the table
again).  tsearch2 is a lot slower, presumably because it's doing a lot
more (although in this case, it doesn't return all that much more).
Here is a typical result of a tsearch2 search on my data:

explain analyze select * from tagged_genes where vector @@
to_tsquery('bcl2');
                                                                QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using vector_idx on tagged_genes  (cost=0.00..56092.98
rows=13983 width=67) (actual time=202.078..43122.688 rows=980 loops=1)
   Index Cond: (vector @@ '\'bcl2\''::tsquery)
 Total runtime: 43124.215 ms

Blech.  I'd love to use tsearch2, but it's hard to see it being useful
with my puny hardware and not so puny data.

I'm using PG 8.0.3 on Mac OS X 10.4.2 on a dual 2.5GHz G5 currently
with 2GB RAM.  The data is physically stored on an Xserve RAID array
(seven 7200rpm ultra-ATA drives, RAID 5) connected via fibre channel
directly to the G5.  I did some tests of this array's performance for a
single
user.  Using a large block size (256K), this array can do sequential reads
at 134 MB/sec, but the performance drops to 12MB/sec for 4K
sequential reads.  Random 4K reads are 5MB/sec, and random 256K
reads are 121MB/sec (all these according to "XBench" - not sure if it's a
good benchmark).

I'm using these shared memory settings:

kern.sysv.shmmax=134217728
kern.sysv.shmall=32768
kern.sysv.shmmin=1
kern.sysv.shmmni=32
kern.sysv.shmseg=8

and from postgresql.conf:

shared_buffers = 15000
work_mem = 32768

Everything else in postgresql.conf is default.

-------

Here is the table description:

tagged_genes table:
13,982,464 rows
422,028 pages  (although about half of that is the experimental tsvector
column, though!)

create table tagged_genes (
        id           bigint NOT NULL PRIMARY KEY,  -- artificial primary key
        mention      text,             -- a gene name or description
        pmid         bigint,           -- identifies the document that
the mention occurs in
        create_date  timestamp NOT NULL,
        vector       tsvector          -- experimental tsearch2 index of
mention column
);
create index tg_mention_idx on tagged_genes(mention);
create index tg_pmid_idx    on tagged_genes(pmid);
create index tg_vector_idx  on tagged_genes(vector);

========================================

Some trials with different search terms:

mycn:
 Index Scan using tagged_genes_mention_idx on tagged_genes
(cost=0.00..327.64 rows=81 width=67) (actual time=41.703..2751.600
rows=479 loops=1)
   Index Cond: (mention = 'mycn'::text)
 Total runtime: 2751.936 ms

mycn trials: 2752 ms, 2755 ms, 2766 ms

bcl2:
 Index Scan using tagged_genes_mention_idx on tagged_genes
(cost=0.00..327.64 rows=81 width=67) (actual time=28.694..5544.779
rows=848 loops=1)
   Index Cond: (mention = 'bcl2'::text)
 Total runtime: 5545.434 ms

bcl2 trials: 5545 ms, 5492 ms, 5493 ms

cyp3a4:
 Index Scan using tagged_genes_mention_idx on tagged_genes
(cost=0.00..7867.60 rows=1958 width=67) (actual time=58.138..9602.558
rows=1985 loops=1)
   Index Cond: (mention = 'cyp3a4'::text)
 Total runtime: 9603.733 ms

cyp3a4 trials: 9604 ms, 11872 ms, 9970 ms

tp53:
 Index Scan using tagged_genes_mention_idx on tagged_genes
(cost=0.00..327.64 rows=81 width=67) (actual time=28.505..8064.808
rows=1484 loops=1)
   Index Cond: (mention = 'tp53'::text)
 Total runtime: 8065.791 ms

P.S.  I've heard that OS X sucks as a server OS.  I don't know if that's
true, but I'm interested to install a PPC Linux and give that a go for
the sake of comparison.
Don't know where that would leave my storage hardware, though.


Re: speeding up a query on a large table

От
Manfred Koizar
Дата:
On Mon, 25 Jul 2005 17:50:55 -0400, Kevin Murphy
<murphy@genome.chop.edu> wrote:
> and because the number of possible search terms is so large, it
>would be nice if the entire index could somehow be preloaded into memory
>and encouraged to stay there.

Postgres does not have such a feature and I wouldn't recommend to mess
around inside Postgres.  You could try to copy the relevant index
file(s) to /dev/null to populate the OS cache ...

>There are 1.3M unique values in that column.

That'd mean that each value occours 10 times on average.  In your
tests the planner consistently estimates 81, and the real numbers are
even higher.  Can this be explained by the nature of the data
distribution?

>and from postgresql.conf:
>shared_buffers = 15000
>work_mem = 32768
>Everything else in postgresql.conf is default.

Setting effective_cache_size to a sane value wouldn't hurt.  I don't
know about OS X; does it, like Linux, automatically tune its disk
cache or do you have to configure it somewhere?

>tagged_genes table:
>13,982,464 rows
>422,028 pages  (although about half of that is the experimental tsvector
>column, though!)
>The index of the query column (mention) is 226,229 pages (= 1.7 GB?).

The average tuples per page ratio seems a bit low, both for the heap
(~33) and for the index (~62).  If the planner's tuple size estimation
of 67 bytes is approximately right, there's a lot of free space in
your relations.  Try VACUUM FULL and REINDEX or CLUSTER to shrink
these files.

>create table tagged_genes (
>        id           bigint NOT NULL PRIMARY KEY,  -- artificial primary key
>        mention      text,             -- a gene name or description
>        pmid         bigint,           -- identifies the document that
>the mention occurs in
>        create_date  timestamp NOT NULL,
>        vector       tsvector          -- experimental tsearch2 index of
>mention column
>);
>create index tg_mention_idx on tagged_genes(mention);
>create index tg_pmid_idx    on tagged_genes(pmid);
>create index tg_vector_idx  on tagged_genes(vector);

If mention is long (which is not implied by your examples, but an int
is still smaller than any nonempty text) and there are many
duplicates, it might pay off to put them in their own table:

CREATE TABLE mentions (
    id    SERIAL PRIMARY KEY,
    mention    text UNIQUE,
        vector  tsvector          -- experimental tsearch2 index
) WITHOUT oids;
create index me_vector_idx  on mentions(vector);

and reference them from tagged_genes:

create table tagged_genes (
        id           bigint NOT NULL PRIMARY KEY,
        mentionid    int REFERENCES mentions,
        pmid         bigint,           -- identifies the document that
                                       -- the mention occurs in
        create_date  timestamp NOT NULL
) WITHOUT oids;

Unless my math is wrong, this would result in a heap size of ~120K
pages and an index size of ~52K pages, plus some 10% slack for updated
and deleted tuples, if you VACUUM regularly.

Servus
 Manfred


Re: speeding up a query on a large table

От
Mike Rylander
Дата:
On 8/17/05, Manfred Koizar <mkoi-pg@aon.at> wrote:
> On Mon, 25 Jul 2005 17:50:55 -0400, Kevin Murphy
> <murphy@genome.chop.edu> wrote:
> > and because the number of possible search terms is so large, it
> >would be nice if the entire index could somehow be preloaded into memory
> >and encouraged to stay there.
>
> Postgres does not have such a feature and I wouldn't recommend to mess
> around inside Postgres.  You could try to copy the relevant index
> file(s) to /dev/null to populate the OS cache ...

That actually works fine.  When I had big problems with a large GiST
index I just used cat to dump it at /dev/null and the OS grabbed it.
Of course, that was on linux so YMMV.

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: speeding up a query on a large table

От
Kevin Murphy
Дата:
Mike Rylander wrote:

>On 8/17/05, Manfred Koizar <mkoi-pg@aon.at> wrote:
>
>
>>On Mon, 25 Jul 2005 17:50:55 -0400, Kevin Murphy
>><murphy@genome.chop.edu> wrote:
>>
>>
>>>and because the number of possible search terms is so large, it
>>>would be nice if the entire index could somehow be preloaded into memory
>>>and encouraged to stay there.
>>>
>>>
>>You could try to copy the relevant index
>>file(s) to /dev/null to populate the OS cache ...
>>
>>
>
>That actually works fine.  When I had big problems with a large GiST
>index I just used cat to dump it at /dev/null and the OS grabbed it.
>Of course, that was on linux so YMMV.
>
>
>
Thanks, Manfred & Mike.  That is a very nice solution.  And just for the
sake of the archive ... I can find the filename of the relevant index or
table file name(s) by finding pg_class.relfilenode where
pg_class.relname is the name of the entity, then doing, e.g.: sudo -u
postgres find /usr/local/pgsql/data -name "somerelfilenode*".

-Kevin Murphy