Обсуждение: Forcing the right queryplan

Поиск
Список
Период
Сортировка

Forcing the right queryplan

От
Henk van Lingen
Дата:
Hi,

I've the problem my database is not using the 'right' queryplan in all
cases. Is there a way I can force that and/or how should I tuned the
table statistics?

I'm doing a rsyslog database in PostgreSQL with millions of records
(firewall logging). The db scheme is the so called 'MonitorWare' scheme,
to wich I added two extra indexes.

syslog=# select version();
                                                     version

--------------------------------------------------------------------------------
----------------------------------
 PostgreSQL 8.4.4 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20
080704 (Red Hat 4.1.2-48), 64-bit


syslog=# \d systemevents
                                         Table "public.systemevents"
       Column       |            Type             |                         Modi
fiers
--------------------+-----------------------------+-----------------------------
------------------------------
 id                 | integer                     | not null default nextval('sy
stemevents_id_seq'::regclass)
 customerid         | bigint                      |
 receivedat         | timestamp without time zone |
 devicereportedtime | timestamp without time zone |
 facility           | smallint                    |
 priority           | smallint                    |
 fromhost           | character varying(60)       |
 message            | text                        |
 ntseverity         | integer                     |
 importance         | integer                     |
 eventsource        | character varying(60)       |
 eventuser          | character varying(60)       |
 eventcategory      | integer                     |
 eventid            | integer                     |
 eventbinarydata    | text                        |
 maxavailable       | integer                     |
 currusage          | integer                     |
 minusage           | integer                     |
 maxusage           | integer                     |
 infounitid         | integer                     |
 syslogtag          | character varying(60)       |
 eventlogtype       | character varying(60)       |
 genericfilename    | character varying(60)       |
 systemid           | integer                     |
Indexes:
    "systemevents_pkey" PRIMARY KEY, btree (id)
    "fromhost_idx" btree (fromhost)
    "msgs_idx" gin (to_tsvector('english'::regconfig, message))

The GIN index is to do text searching (via LogAnalyzer).

Now there are two types of query plans:

syslog=# explain SELECT id, devicereportedtime, facility, priority, fromhost, syslogtag, infounitid, message FROM
systemeventsWHERE (  ( to_tsvector('english', message) @@ to_tsquery ( '131.211.112.9')) )  ORDER BY id DESC LIMIT 100;
                                                 QUERY PLAN                    

--------------------------------------------------------------------------------
---------------------------------
 Limit  (cost=0.00..10177.22 rows=100 width=159)
   ->  Index Scan Backward using systemevents_pkey on systemevents  (cost=0.00..
1052934.86 rows=10346 width=159)
         Filter: (to_tsvector('english'::regconfig, message) @@ to_tsquery('131.
211.112.9'::text))
(3 rows)

This one is useless (takes very long). However this one:

syslog=# explain SELECT id, devicereportedtime, facility, priority, fromhost, syslogtag, infounitid, message FROM
systemeventsWHERE (  ( to_tsvector('english', message) @@ to_tsquery ( '131.211.112.9')) )  ORDER BY id DESC LIMIT 500; 
                                                    QUERY PLAN

--------------------------------------------------------------------------------
-----------------------------------
 Limit  (cost=40928.89..40930.14 rows=500 width=159)
   ->  Sort  (cost=40928.89..40954.76 rows=10346 width=159)
         Sort Key: id
         ->  Bitmap Heap Scan on systemevents  (cost=2898.06..40413.36 rows=1034
6 width=159)
               Recheck Cond: (to_tsvector('english'::regconfig, message) @@ to_t
squery('131.211.112.9'::text))
               ->  Bitmap Index Scan on msgs_idx  (cost=0.00..2895.47 rows=10346
 width=0)
                     Index Cond: (to_tsvector('english'::regconfig, message) @@
to_tsquery('131.211.112.9'::text))
(7 rows)

works acceptable.

Stats:

syslog=# SELECT relname, relkind, reltuples, relpages
FROM pg_class                                                      WHERE relname LIKE 'systemevents%';
                                      relname            | relkind |  reltuples  | relpages  
-------------------------------+---------+-------------+----------
 systemevents_pkey             | i       | 2.06915e+06 |    71985
 systemeventsproperties        | r       |           0 |        0
 systemeventsproperties_pkey   | i       |           0 |        1
 systemevents_id_seq           | S       |           1 |        1
 systemeventsproperties_id_seq | S       |           1 |        1
 systemevents                  | r       | 2.06915e+06 |   694826
(6 rows)

syslog=# SELECT relname, relkind, reltuples, relpages
FROM pg_class
WHERE relname LIKE 'msg%';
 relname  | relkind |  reltuples  | relpages
----------+---------+-------------+----------
 msgs_idx | i       | 2.06915e+06 |   128069
(1 row)

How to use the right plan regardless of the 'LIMIT-size'?

Cheers,
--
Henk van Lingen, ICT-SC Netwerk & Telefonie,                  (o-      -+
Universiteit Utrecht, Jenalaan 18a, room 0.12                 /\        |
phone: +31-30-2538453                                         v_/_      |
http://henk.vanlingen.net/             http://www.tuxtown.net/netiquette/

Re: Forcing the right queryplan

От
Henk van Lingen
Дата:
No ideas on this one?

Regards,

On Tue, Aug 31, 2010 at 04:50:09PM +0200, Henk van Lingen wrote:
  >
  > Hi,
  >
  > I've the problem my database is not using the 'right' queryplan in all
  > cases. Is there a way I can force that and/or how should I tuned the
  > table statistics?
  >
  > I'm doing a rsyslog database in PostgreSQL with millions of records
  > (firewall logging). The db scheme is the so called 'MonitorWare' scheme,
  > to wich I added two extra indexes.
  >
  > syslog=# select version();
  >                                                      version
  >
  > --------------------------------------------------------------------------------
  > ----------------------------------
  >  PostgreSQL 8.4.4 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20
  > 080704 (Red Hat 4.1.2-48), 64-bit
  >
  >
  > syslog=# \d systemevents
  >                                          Table "public.systemevents"
  >        Column       |            Type             |                         Modi
  > fiers
  > --------------------+-----------------------------+-----------------------------
  > ------------------------------
  >  id                 | integer                     | not null default nextval('sy
  > stemevents_id_seq'::regclass)
  >  customerid         | bigint                      |
  >  receivedat         | timestamp without time zone |
  >  devicereportedtime | timestamp without time zone |
  >  facility           | smallint                    |
  >  priority           | smallint                    |
  >  fromhost           | character varying(60)       |
  >  message            | text                        |
  >  ntseverity         | integer                     |
  >  importance         | integer                     |
  >  eventsource        | character varying(60)       |
  >  eventuser          | character varying(60)       |
  >  eventcategory      | integer                     |
  >  eventid            | integer                     |
  >  eventbinarydata    | text                        |
  >  maxavailable       | integer                     |
  >  currusage          | integer                     |
  >  minusage           | integer                     |
  >  maxusage           | integer                     |
  >  infounitid         | integer                     |
  >  syslogtag          | character varying(60)       |
  >  eventlogtype       | character varying(60)       |
  >  genericfilename    | character varying(60)       |
  >  systemid           | integer                     |
  > Indexes:
  >     "systemevents_pkey" PRIMARY KEY, btree (id)
  >     "fromhost_idx" btree (fromhost)
  >     "msgs_idx" gin (to_tsvector('english'::regconfig, message))
  >
  > The GIN index is to do text searching (via LogAnalyzer).
  >
  > Now there are two types of query plans:
  >
  > syslog=# explain SELECT id, devicereportedtime, facility, priority, fromhost, syslogtag, infounitid, message FROM
systemeventsWHERE (  ( to_tsvector('english', message) @@ to_tsquery ( '131.211.112.9')) )  ORDER BY id DESC LIMIT 100;
                                                 QUERY PLAN                    
  >
  > --------------------------------------------------------------------------------
  > ---------------------------------
  >  Limit  (cost=0.00..10177.22 rows=100 width=159)
  >    ->  Index Scan Backward using systemevents_pkey on systemevents  (cost=0.00..
  > 1052934.86 rows=10346 width=159)
  >          Filter: (to_tsvector('english'::regconfig, message) @@ to_tsquery('131.
  > 211.112.9'::text))
  > (3 rows)
  >
  > This one is useless (takes very long). However this one:
  >
  > syslog=# explain SELECT id, devicereportedtime, facility, priority, fromhost, syslogtag, infounitid, message FROM
systemeventsWHERE (  ( to_tsvector('english', message) @@ to_tsquery ( '131.211.112.9')) )  ORDER BY id DESC LIMIT 500; 
  >                                                     QUERY PLAN
  >
  > --------------------------------------------------------------------------------
  > -----------------------------------
  >  Limit  (cost=40928.89..40930.14 rows=500 width=159)
  >    ->  Sort  (cost=40928.89..40954.76 rows=10346 width=159)
  >          Sort Key: id
  >          ->  Bitmap Heap Scan on systemevents  (cost=2898.06..40413.36 rows=1034
  > 6 width=159)
  >                Recheck Cond: (to_tsvector('english'::regconfig, message) @@ to_t
  > squery('131.211.112.9'::text))
  >                ->  Bitmap Index Scan on msgs_idx  (cost=0.00..2895.47 rows=10346
  >  width=0)
  >                      Index Cond: (to_tsvector('english'::regconfig, message) @@
  > to_tsquery('131.211.112.9'::text))
  > (7 rows)
  >
  > works acceptable.
  >
  > Stats:
  >
  > syslog=# SELECT relname, relkind, reltuples, relpages
  > FROM pg_class                                                      WHERE relname LIKE 'systemevents%';
                                          relname            | relkind |  reltuples  | relpages  
  > -------------------------------+---------+-------------+----------
  >  systemevents_pkey             | i       | 2.06915e+06 |    71985
  >  systemeventsproperties        | r       |           0 |        0
  >  systemeventsproperties_pkey   | i       |           0 |        1
  >  systemevents_id_seq           | S       |           1 |        1
  >  systemeventsproperties_id_seq | S       |           1 |        1
  >  systemevents                  | r       | 2.06915e+06 |   694826
  > (6 rows)
  >
  > syslog=# SELECT relname, relkind, reltuples, relpages
  > FROM pg_class
  > WHERE relname LIKE 'msg%';
  >  relname  | relkind |  reltuples  | relpages
  > ----------+---------+-------------+----------
  >  msgs_idx | i       | 2.06915e+06 |   128069
  > (1 row)
  >
  > How to use the right plan regardless of the 'LIMIT-size'?
  >
  > Cheers,
  > --
  > Henk van Lingen, ICT-SC Netwerk & Telefonie,                  (o-      -+
  > Universiteit Utrecht, Jenalaan 18a, room 0.12                 /\        |
  > phone: +31-30-2538453                                         v_/_      |
  > http://henk.vanlingen.net/             http://www.tuxtown.net/netiquette/
  >
  > --
  > Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
  > To make changes to your subscription:
  > http://www.postgresql.org/mailpref/pgsql-general

--
Henk van Lingen, ICT-SC Netwerk & Telefonie,                  (o-      -+
Universiteit Utrecht, Jenalaan 18a, room 0.12                 /\        |
phone: +31-30-2538453                                         v_/_      |
http://henk.vanlingen.net/             http://www.tuxtown.net/netiquette/

Re: Forcing the right queryplan

От
Yeb Havinga
Дата:
Henk van Lingen wrote:
> Now there are two types of query plans:
>
> syslog=# explain SELECT id, devicereportedtime, facility, priority, fromhost, syslogtag, infounitid, message FROM
systemeventsWHERE (  ( to_tsvector('english', message) @@ to_tsquery ( '131.211.112.9')) )  ORDER BY id DESC LIMIT 100;
                                                 QUERY PLAN                    
>
>  Limit  (cost=0.00..10177.22 rows=100 width=159)
>    ->  Index Scan Backward using systemevents_pkey on systemevents  (cost=0.00..
> 1052934.86 rows=10346 width=159)
>          Filter: (to_tsvector('english'::regconfig, message) @@ to_tsquery('131.
> 211.112.9'::text))
> (3 rows)
>
> This one is useless (takes very long). However this one:
>
Hello Henk,

I saw your other mail today, I'm replying on this one for better formatting.

With a limit of 100 the planner guesses it will find 100 matching rows
within some cost. At 500 rows the cost is higher than that of the second
plan:
> syslog=# explain SELECT id, devicereportedtime, facility, priority, fromhost, syslogtag, infounitid, message FROM
systemeventsWHERE (  ( to_tsvector('english', message) @@ to_tsquery ( '131.211.112.9')) )  ORDER BY id DESC LIMIT 500; 
>                                                     QUERY PLAN
>
> --------------------------------------------------------------------------------
> -----------------------------------
>  Limit  (cost=40928.89..40930.14 rows=500 width=159)
>    ->  Sort  (cost=40928.89..40954.76 rows=10346 width=159)
>          Sort Key: id
>          ->  Bitmap Heap Scan on systemevents  (cost=2898.06..40413.36 rows=1034
> 6 width=159)
>                Recheck Cond: (to_tsvector('english'::regconfig, message) @@ to_t
> squery('131.211.112.9'::text))
>                ->  Bitmap Index Scan on msgs_idx  (cost=0.00..2895.47 rows=10346
>  width=0)
>                      Index Cond: (to_tsvector('english'::regconfig, message) @@
> to_tsquery('131.211.112.9'::text))
> (7 rows)
>
> works acceptable.
>
> How to use the right plan regardless of the 'LIMIT-size'?
>
The planner obviously thinks it will have read 100 rows from
systemevents backwards earlier than it actually does, with the where
clause that contains the scanning for string 131.211.112.9. Increasing
the stats target in this case will probably not help, since the
statistics will not contain selectivity for all possible ts queries.

If the index is useless anyway, you might consider dropping it.
Otherwise, increasing random_page_cost might help in choosing the
otherplan, but on the other hand that plan has index scanning too, so
I'm not to sure there.

If that doesn't help, it would be interesting to see some output of
vmstat 1 (or better: iostat -xk 1) to see what is the bottleneck during
execution of the first plan. If it is IO bound, you might want to
increase RAM or add spindles for increased random io performance. If it
is CPU bound, it is probably because of executing the to_tsvector
function. In that case it might be interesting to see if changing
ts_vectors cost (see ALTER FUNCTION ... COST .../
http://developer.postgresql.org/pgdocs/postgres/sql-alterfunction.html)
again helps the planner to favor the second plan over the first.

regards,
Yeb Havinga


Re: Forcing the right queryplan

От
Henk van Lingen
Дата:
On Fri, Sep 03, 2010 at 09:20:39AM +0200, Yeb Havinga wrote:
  >
  > If the index is useless anyway, you might consider dropping it.
  > Otherwise, increasing random_page_cost might help in choosing the
  > otherplan, but on the other hand that plan has index scanning too,
  > so I'm not to sure there.
  >
  > If that doesn't help, it would be interesting to see some output
  > of vmstat 1 (or better: iostat -xk 1) to see what is the
  > bottleneck during execution of the first plan. If it is IO bound,
  > you might want to increase RAM or add spindles for increased
  > random io performance. If it is CPU bound, it is probably because
  > of executing the to_tsvector function. In that case it might be
  > interesting to see if changing ts_vectors cost (see ALTER FUNCTION

Hi Yeb,

Thanks for your answer. Dropping the (pkey) index is not an option.
iostat suggest the thing is CPU bound (%iowait remaining 11% but cpu
rizing from 1 to 13 %)
However, I'm reluctant to changing the to_tsvector costs. (besides
not knowing how the find out the current value). The pkey is also
used for queries like this one, which also results in the wrong
queryplan:

syslog=# explain SELECT id, devicereportedtime, facility, priority, fromhost, syslogtag, infounitid, message FROM
systemeventsWHERE (  ( to_tsvector('english', message) @@ to_tsquery ( 'error')) )  AND id <= 26689837 ; 
                                          QUERY PLAN

--------------------------------------------------------------------------------
---------------
 Index Scan using systemevents_pkey on systemevents  (cost=0.00..27302.74 rows=2
174 width=158)
   Index Cond: (id <= 26689837)
   Filter: (to_tsvector('english'::regconfig, message) @@ to_tsquery('error'::te
xt))
(3 rows)

So I'm afraid that not being able to force a plan is a showstopper for
using postgresql with full text search for this project.

Regards,
--
Henk van Lingen, ICT-SC Netwerk & Telefonie,                  (o-      -+
Universiteit Utrecht, Jenalaan 18a, room 0.12                 /\        |
phone: +31-30-2538453                                         v_/_      |
http://henk.vanlingen.net/             http://www.tuxtown.net/netiquette/

Re: Forcing the right queryplan

От
Alban Hertroys
Дата:
Sorry for not replying earlier, I've been quite busy.

On 31 Aug 2010, at 16:50, Henk van Lingen wrote:

> syslog=# \d systemevents
>                                         Table "public.systemevents"
>       Column       |            Type             |                         Modi
> fiers
> --------------------+-----------------------------+-----------------------------
> ------------------------------
> id                 | integer                     | not null default nextval('sy

(...)

> message            | text                        |

(...)

> Indexes:
>    "systemevents_pkey" PRIMARY KEY, btree (id)
>    "fromhost_idx" btree (fromhost)
>    "msgs_idx" gin (to_tsvector('english'::regconfig, message))
>
> The GIN index is to do text searching (via LogAnalyzer).
>
> Now there are two types of query plans:

Do you have output of explain analyse for these queries as well? It's hard to see what is actually going on with just
theexplain - we can't see which part of the query is more expensive than the planner expected, for starters. 

> syslog=# explain SELECT id, devicereportedtime, facility, priority, fromhost, syslogtag, infounitid, message FROM
systemeventsWHERE (  ( to_tsvector('english', message) @@ to_tsquery ( '131.211.112.9')) )  ORDER BY id DESC LIMIT 100;
                                                 QUERY PLAN                    
>
> --------------------------------------------------------------------------------
> ---------------------------------
> Limit  (cost=0.00..10177.22 rows=100 width=159)
>   ->  Index Scan Backward using systemevents_pkey on systemevents  (cost=0.00..
> 1052934.86 rows=10346 width=159)
>         Filter: (to_tsvector('english'::regconfig, message) @@ to_tsquery('131.
> 211.112.9'::text))
> (3 rows)
>
> This one is useless (takes very long). However this one:
>
> syslog=# explain SELECT id, devicereportedtime, facility, priority, fromhost, syslogtag, infounitid, message FROM
systemeventsWHERE (  ( to_tsvector('english', message) @@ to_tsquery ( '131.211.112.9')) )  ORDER BY id DESC LIMIT 500; 
>                                                    QUERY PLAN
>
> --------------------------------------------------------------------------------
> -----------------------------------
> Limit  (cost=40928.89..40930.14 rows=500 width=159)
>   ->  Sort  (cost=40928.89..40954.76 rows=10346 width=159)
>         Sort Key: id
>         ->  Bitmap Heap Scan on systemevents  (cost=2898.06..40413.36 rows=1034
> 6 width=159)
>               Recheck Cond: (to_tsvector('english'::regconfig, message) @@ to_t
> squery('131.211.112.9'::text))
>               ->  Bitmap Index Scan on msgs_idx  (cost=0.00..2895.47 rows=10346
> width=0)
>                     Index Cond: (to_tsvector('english'::regconfig, message) @@
> to_tsquery('131.211.112.9'::text))
> (7 rows)
>
> works acceptable.

Odd that more records and a more complicated plan gives faster results... That's why I think we'd really want to see
explainanalyse output. 
I'm guessing that there are a lot of records matching your search string and that you've found the cut-off point where
theplanner thinks you're throwing away enough rows that it's not very useful to first select all the matching records
beforesorting the results. 

I think it decided to just start searching backwards along the id and returning the rows that match that IP (and are
visibleto your transaction) would be faster than trying to work with all the rows that match that IP. 

This probably means it misjudged the costs of sorting your index backwards, which indicates that your planning
statisticsare off, or that your cost parameters aren't appropriate for your system. 



One thing I do notice is that the first plan uses the index on id instead of the ts_vector one. For queries like those
youcould try to use a combined index like this: 

CREATE INDEX msgs_idx (to_tsvector('english'::regconfig, message), id) ON systemevents USING (gin);

Whether to use gist or gin there I really don't know. I'd probably go for gist, I seem to recall that gin is fairly
heavyto use. 

> How to use the right plan regardless of the 'LIMIT-size'?


You could try turning off planner-options, but that's probably a fairly bad idea.

Other options are to use a prepared statement or a stored procedure with the IP as a parameter, which force the planner
touse a more general plan because it doesn't know which values you're going to search for before it plans the query. 

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4c8675d010409863511634!



Re: Forcing the right queryplan

От
Scott Marlowe
Дата:
On Tue, Sep 7, 2010 at 8:48 AM, Henk van Lingen <H.G.K.vanLingen@uu.nl> wrote:
>
> Thanks for your answer. Dropping the (pkey) index is not an option.
> iostat suggest the thing is CPU bound (%iowait remaining 11% but cpu
> rizing from 1 to 13 %)

How man cores that server have?  If you've got 8 cores and one IO
bound on this query it'll hit 12%.. IOBound.  rely on iostat -xd
instead.

Re: Forcing the right queryplan

От
Henk van Lingen
Дата:
On Tue, Sep 07, 2010 at 07:26:25PM +0200, Alban Hertroys wrote:
  >
  > Do you have output of explain analyse for these queries as well? It's
  > hard to see what is actually going on with just the explain - we can't
  > see which part of the query is more expensive than the planner
  > expected, for starters.

Hi Alban,

Here are the explain analyse versions:

syslog=# explain analyze SELECT id, devicereportedtime, facility, priority, fromhost, syslogtag, infounitid, message
FROMsystemevents WHERE (  (to_tsvector('english', message) @@ to_tsquery ( '131.211.112.9')) )  ORDER BY id DESC LIMIT
100;
                                                                               Q
UERY PLAN

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------
 Limit  (cost=0.00..1205.09 rows=100 width=158) (actual time=16740.139..2360334.
006 rows=100 loops=1)
   ->  Index Scan Backward using systemevents_pkey on systemevents  (cost=0.00..
2888974.17 rows=239730 width=158) (actual time=16740.137..2360333.916 rows=100 l
oops=1)
         Filter: (to_tsvector('english'::regconfig, message) @@ to_tsquery('131.
211.112.9'::text))
 Total runtime: 2360334.078 ms
(4 rows)

syslog=# explain analyze SELECT id, devicereportedtime, facility, priority, fromhost, syslogtag, infounitid, message
FROMsystemevents WHERE (  ( to_tsvector('english', message) @@ to_tsquery ( '131.211.112.9')) )  ORDER BY id DESC LIMIT
500000;
                                                                  QUERY PLAN

--------------------------------------------------------------------------------
---------------------------------------------------------------
 Limit  (cost=727944.30..728543.82 rows=239805 width=158) (actual time=1805.251.
.1805.388 rows=464 loops=1)
   ->  Sort  (cost=727944.30..728543.82 rows=239805 width=158) (actual time=1805
.249..1805.300 rows=464 loops=1)
         Sort Key: id
         Sort Method:  quicksort  Memory: 148kB
         ->  Bitmap Heap Scan on systemevents  (cost=61221.23..668806.93 rows=23
9805 width=158) (actual time=9.131..1786.406 rows=464 loops=1)
               Recheck Cond: (to_tsvector('english'::regconfig, message) @@ to_t
squery('131.211.112.9'::text))
               ->  Bitmap Index Scan on msgs_idx  (cost=0.00..61161.28 rows=2398
05 width=0) (actual time=0.790..0.790 rows=464 loops=1)
                     Index Cond: (to_tsvector('english'::regconfig, message) @@
to_tsquery('131.211.112.9'::text))
 Total runtime: 1805.483 ms
(9 rows)

  > Odd that more records and a more complicated plan gives faster results...
  > That's why I think we'd really want to see explain analyse output.
  > I'm guessing that there are a lot of records matching your search string

As you can see, there are only 464 matches.

  > One thing I do notice is that the first plan uses the index on id
  > instead of the ts_vector one. For queries like those you could try to
  > use a combined index like this:
  >
  > CREATE INDEX msgs_idx (to_tsvector('english'::regconfig, message), id)
  > ON systemevents USING (gin);

I will look into this.

Thanks,
--
Henk van Lingen, ICT-SC Netwerk & Telefonie,                  (o-      -+
Universiteit Utrecht, Jenalaan 18a, room 0.12                 /\        |
phone: +31-30-2538453                                         v_/_      |
http://henk.vanlingen.net/             http://www.tuxtown.net/netiquette/

Re: Forcing the right queryplan

От
Tom Lane
Дата:
Henk van Lingen <H.G.K.vanLingen@uu.nl> writes:
>          ->  Bitmap Heap Scan on systemevents  (cost=61221.23..668806.93 rows=239805 width=158) (actual
time=9.131..1786.406rows=464 loops=1) 
>                Recheck Cond: (to_tsvector('english'::regconfig, message) @@ to_tsquery('131.211.112.9'::text))

Well, there's your problem: the planner is off by a factor of about 500
on its estimate of the number of rows matching this query, and that's
what's causing it to pick the wrong plan.  What you need to look into
is getting that estimate to be more in sync with reality.  Probably
increasing the stats target for the message column would help.

            regards, tom lane

Re: Forcing the right queryplan

От
Henk van Lingen
Дата:
On Thu, Sep 09, 2010 at 10:50:52AM -0400, Tom Lane wrote:
  > Henk van Lingen <H.G.K.vanLingen@uu.nl> writes:
  > >          ->  Bitmap Heap Scan on systemevents  (cost=61221.23..668806.93 rows=239805 width=158) (actual
time=9.131..1786.406rows=464 loops=1) 
  > >                Recheck Cond: (to_tsvector('english'::regconfig, message) @@ to_tsquery('131.211.112.9'::text))
  >
  > Well, there's your problem: the planner is off by a factor of about 500
  > on its estimate of the number of rows matching this query, and that's
  > what's causing it to pick the wrong plan.  What you need to look into
  > is getting that estimate to be more in sync with reality.  Probably
  > increasing the stats target for the message column would help.

  But how can I get sane estimates for syslog data? Some searchstrings will
  result in only a few hits, others in thousands of records or more.

  Regards,


--
Henk van Lingen, ICT-SC Netwerk & Telefonie,                  (o-      -+
Universiteit Utrecht, Jenalaan 18a, room 0.12                 /\        |
phone: +31-30-2538453                                         v_/_      |
http://henk.vanlingen.net/             http://www.tuxtown.net/netiquette/

Re: Forcing the right queryplan

От
Tom Lane
Дата:
Henk van Lingen <H.G.K.vanLingen@uu.nl> writes:
> On Thu, Sep 09, 2010 at 10:50:52AM -0400, Tom Lane wrote:
>>>> Well, there's your problem: the planner is off by a factor of about 500
>>>> on its estimate of the number of rows matching this query, and that's
>>>> what's causing it to pick the wrong plan.  What you need to look into
>>>> is getting that estimate to be more in sync with reality.  Probably
>>>> increasing the stats target for the message column would help.

>   But how can I get sane estimates for syslog data? Some searchstrings will
>   result in only a few hits, others in thousands of records or more.

That's what ANALYZE is for ...

            regards, tom lane

Re: Forcing the right queryplan

От
Henk van Lingen
Дата:
On Thu, Sep 09, 2010 at 11:16:36AM -0400, Tom Lane wrote:
  > Henk van Lingen <H.G.K.vanLingen@uu.nl> writes:
  > > On Thu, Sep 09, 2010 at 10:50:52AM -0400, Tom Lane wrote:
  > >>>> Well, there's your problem: the planner is off by a factor of about 500
  > >>>> on its estimate of the number of rows matching this query, and that's
  > >>>> what's causing it to pick the wrong plan.  What you need to look into
  > >>>> is getting that estimate to be more in sync with reality.  Probably
  > >>>> increasing the stats target for the message column would help.
  >
  > >   But how can I get sane estimates for syslog data? Some searchstrings will
  > >   result in only a few hits, others in thousands of records or more.
  >
  > That's what ANALYZE is for ...

  Yes, off course. But I don't see how the most_common_vals & freqs and the
  histogram_bounds for a text field with syslog data make any sense when
  doing doing a search for a substring. Increasing the number of entries in
  those stats lists doesn't make any sense also, i presume.

  Those stats should be based on analysis of the to_tsvector index, to have
  any meaning, i think.

  Today I will look into the multicolumn index suggestion.

  Regards,

--
Henk van Lingen, ICT-SC Netwerk & Telefonie,                  (o-      -+
Universiteit Utrecht, Jenalaan 18a, room 0.12                 /\        |
phone: +31-30-2538453                                         v_/_      |
http://henk.vanlingen.net/             http://www.tuxtown.net/netiquette/

Re: Forcing the right queryplan

От
Arjen Nienhuis
Дата:
Maybe you can extract stuff like IP addresses and words like 'error'
and put it in a separate column in the table. Full text search is not
a solution for data that is in a wrong format.

On Fri, Sep 10, 2010 at 10:27 AM, Henk van Lingen <H.G.K.vanLingen@uu.nl> wrote:
> On Thu, Sep 09, 2010 at 11:16:36AM -0400, Tom Lane wrote:
>  > Henk van Lingen <H.G.K.vanLingen@uu.nl> writes:
>  > > On Thu, Sep 09, 2010 at 10:50:52AM -0400, Tom Lane wrote:
>  > >>>> Well, there's your problem: the planner is off by a factor of about 500
>  > >>>> on its estimate of the number of rows matching this query, and that's
>  > >>>> what's causing it to pick the wrong plan.  What you need to look into
>  > >>>> is getting that estimate to be more in sync with reality.  Probably
>  > >>>> increasing the stats target for the message column would help.
>  >
>  > >   But how can I get sane estimates for syslog data? Some searchstrings will
>  > >   result in only a few hits, others in thousands of records or more.
>  >
>  > That's what ANALYZE is for ...
>
>  Yes, off course. But I don't see how the most_common_vals & freqs and the
>  histogram_bounds for a text field with syslog data make any sense when
>  doing doing a search for a substring. Increasing the number of entries in
>  those stats lists doesn't make any sense also, i presume.
>
>  Those stats should be based on analysis of the to_tsvector index, to have
>  any meaning, i think.
>
>  Today I will look into the multicolumn index suggestion.
>
>  Regards,
>
> --
> Henk van Lingen, ICT-SC Netwerk & Telefonie,                  (o-      -+
> Universiteit Utrecht, Jenalaan 18a, room 0.12                 /\        |
> phone: +31-30-2538453                                         v_/_      |
> http://henk.vanlingen.net/             http://www.tuxtown.net/netiquette/
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>