Обсуждение: Postgres query optimization with varchar fields

Поиск
Список
Период
Сортировка

Postgres query optimization with varchar fields

От
"W.H. van Atteveldt"
Дата:
Dear reader,

I am investigating whether it is useful to directly query a database
containing a rather large text corpus (order of magnitude 100k - 1m
newspaper articles, so around 100 million words), or whether I should
use third party text indexing services. I want to know things such as:
how often is a certain word (or pattern) mentioned in an article and how
often it is mentioned with the condition that another word is nearby
(same article or n words distant).

I created a table listing the words one word per row, and created an
index on the word and wordnr columns. An example query would be:

simple: select articleid, count(*) as count from words w where articleid
in (select id from articles where batchid in (84,85,100,101,118,121))
and (word like '<PATTERN>')  group by articleid
complex: select articleid, count(*) as count from words w where
articleid in (select id from articles where batchid in
(84,85,100,101,118,121)) and (word like '<PATTERN>')  and exists (select
* from words w2 where w.articleid = w2.articleid and (word like
'<PATTERN2>')) group by articleid

According to the diagnostics, the database does use the indices for the
query, but it is still rather slow (around 10 minutes for a 'simple
query', x seconds for a complex one)

It is important that the complex query only counts instances where the
PATTERN is found and PATTERN2 only functions as a criterium and does not
add to the count.

My questions are: (technical details provided below)
- Does anyone disagree with the general setup?
- Is there a more sensible way to phrase my SQL?
- Any other ideas to improve performance?

Thanks,

Wouter van Atteveldt
Free University Amsterdam

------

Technicalities:

I am using a Postgresql 7.4.1 database on a linux machine (uname -a:
Linux swpc450.cs.vu.nl 2.4.22-1.2115.nptl #1 Wed Oct 29 15:31:21 EST
2003 i686 athlon i386 GNU/Linux). The table of interest is: (lemma, pos,
simplepos currently not used)

                                    Table "public.words"
   Column   |          Type          |                       Modifiers
------------+------------------------+----------------------------------
---------------------
 id         | integer                | not null default
nextval('public.words_id_seq'::text)
 articleid  | integer                | not null
 sentencenr | integer                | not null
 word       | character varying(255) | not null
 lemma      | character varying(255) |
 pos        | character varying(255) |
 simplepos  | character(1)           |
 wordnr     | integer                | not null
 parnr      | integer                | not null
Indexes:
    "words_pkey" primary key, btree (id)
    "words_aid" btree (articleid)
    "words_word" btree (word)
    "words_word_ptrn" btree (word varchar_pattern_ops)
    "words_wordnr" btree (wordnr)

Query plans:

anoko=> explain select articleid, count(*) as count from words w where
articleid in (select id from articles where batchid in
(84,85,100,101,118,121)) and (word like 'integratie%')  group by
articleid;

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------
 HashAggregate  (cost=937959.21..937959.22 rows=2 width=4)
   ->  Hash IN Join  (cost=95863.70..937816.01 rows=28640 width=4)
         Hash Cond: ("outer".articleid = "inner".id)
         ->  Index Scan using words_word_ptrn on words w
(cost=0.00..836604.62 rows=208886 width=4)
               Index Cond: (((word)::text ~>=~ 'integratie'::character
varying) AND ((word)::text ~<~ 'integratif'::character varying))
               Filter: ((word)::text ~~ 'integratie%'::text)
         ->  Hash  (cost=94998.60..94998.60 rows=146041 width=4)
               ->  Index Scan using articles_batchid, articles_batchid,
articles_batchid, articles_batchid, articles_batchid, articles_batchid
on articles  (cost=0.00..94998.60 rows=146041 width=4)
                     Index Cond: ((batchid = 84) OR (batchid = 85) OR
(batchid = 100) OR (batchid = 101) OR (batchid = 118) OR (batchid =
121))

explain select articleid, count(*) as count from words w where articleid
in (select id from articles where batchid in (84,85,100,101,118,121))
and (word like '<PATTERN>')  and exists (select * from words w2 where
w.articleid = w2.articleid and (word like '<PATTERN2>')) group by
articleid
anoko-> ;

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------
 GroupAggregate  (cost=168253089.23..168254556.46 rows=1 width=4)
   ->  Merge IN Join  (cost=168253089.23..168254484.85 rows=14320
width=4)
         Merge Cond: ("outer".articleid = "inner".id)
         ->  Sort  (cost=168144438.23..168144699.33 rows=104443 width=4)
               Sort Key: w.articleid
               ->  Index Scan using words_word_ptrn on words w
(cost=0.00..168134972.17 rows=104443 width=4)
                     Index Cond: ((word)::text ~=~
'<PATTERN>'::character varying)
                     Filter: (((word)::text ~~ '<PATTERN>'::text) AND
(subplan))
                     SubPlan
                       ->  Index Scan using words_aid on words w2
(cost=0.00..836948.84 rows=1045 width=460)
                             Index Cond: ($0 = articleid)
                             Filter: ((word)::text ~~
'<PATTERN2>'::text)
         ->  Sort  (cost=108651.01..109016.11 rows=146041 width=4)
               Sort Key: articles.id
               ->  Index Scan using articles_batchid, articles_batchid,
articles_batchid, articles_batchid, articles_batchid, articles_batchid
on articles  (cost=0.00..94998.60 rows=146041 width=4)
                     Index Cond: ((batchid = 84) OR (batchid = 85) OR
(batchid = 100) OR (batchid = 101) OR (batchid = 118) OR (batchid =
121))

Re: Postgres query optimization with varchar fields

От
Christopher Kings-Lynne
Дата:
> I am investigating whether it is useful to directly query a database
> containing a rather large text corpus (order of magnitude 100k - 1m
> newspaper articles, so around 100 million words), or whether I should
> use third party text indexing services. I want to know things such as:
> how often is a certain word (or pattern) mentioned in an article and how
> often it is mentioned with the condition that another word is nearby
> (same article or n words distant).

You really want to use the contrib/tsearch2 module that comes already
with PostgreSQL.

cd contrib/tsearch2
gmake install
psql <mydb> < tsearch2.sql
more README.tsearch2

Chris