Re: Problem search on text arrays, using the overlaps (&&) operator

Поиск
Список
Период
Сортировка
От nha
Тема Re: Problem search on text arrays, using the overlaps (&&) operator
Дата
Msg-id 4A547F6E.4030705@free.fr
обсуждение исходный текст
Ответ на Re: Problem search on text arrays, using the overlaps (&&) operator  (John Cheng <jlcheng@ymail.com>)
Список pgsql-general
Hello,

Le 8/07/09 0:52, John Cheng a écrit :
> I don't mean to be pesky. I was just wondering if there is anything
> else I should try?
>
> Should I simply rewrite all queries, change the form
>
> WHERE textarr && '{foo, bar}'::text[]
>
> To
>
> WHERE (textarr && '{foo}'::text[]
>        OR textarr && '{bar}'::text[])
>
> ?
>

While still puzzled by the big runtime difference you report between the
2 condition forms, I went on assessing these runtimes on my side from
the new case statements that are assumed to reflect more the real world.

Here are some measure results I got: (sorry for this long table)

seq    style    runtime
---    -----    -------
(db=slf)
N01    OR-EA    6 237
N02    CC-EA    5 250
N03    OR+EA    12 628
N04    CC+EA    12 700
N05    OR+EA    15 679
N06    CC+EA    11 510
N07    CC-EA    7 712
N08    OR-EA    8 741
N09    CC-EA    4 963
N10    OR-EA    6 499
(db=stg)
N11    CC+EA    15 978
N12    OR+EA    15 350
N13    CC-EA    8 102
N14    OR-EA    9 428
N15    OR-EA    5 267
N16    CC-EA    5 017
N17    OR-EA    6 119
N18    CC-EA    4 955
N19    OR+EA    11 722
N20    CC+EA    11 532
N21    OR-EA    7 303
N22    CC-EA    5 438
N23    CC-EA    5 519
N24    OR-EA    5 373
N25    OR-EA    5 422
N26    CC-EA    5 064
(db=stg)
N27    CC-EA    8 314
(db=slf)
N28    OR-EA    6 656
(db=stg)
N29    OR-EA    6 760
(db=slf)
N30    CC-EA    6 753
(db=stg)
N31    CC-EA    5 500
(db=slf)
N32    OR-EA    5 907
(db=stg)
N33    OR-EA    5 391
(db=slf)
N34    CC-EA    5 517
---    -----    ----------

Legend
------
seq: sequence order.
style: condition style of query.
    CC: style "arr&&{f,b}" (one clause with multi-value text table).
    OR: style "arr&&{f} or arr&&{b}" (many clauses with 1-value text table).
    OR2: same style as style OR, with explicit JOIN in query expression.
    +EA: performed with EXPLAIN ANALYZE on query. Slower.
    -EA: performed without EXPLAIN ANALYZE on query. Faster.
runtime: run time in milliseconds.
(db=?): indicates that the following sequences have been performed after
a drop-and-create process for all the tables and indexes.
------

Results from 2 selected EXPLAIN ANALYZE sequences:

-- seq 03 (OR+EA)
Aggregate  (cost=37630.52..37630.53 rows=1 width=0) (actual
time=12628.182..12628.184 rows=1 loops=1)
  ->  Hash Join  (cost=25989.12..37601.04 rows=11792 width=0) (actual
time=8796.002..12231.422 rows=300000 loops=1)
        Hash Cond: ((bar.id)::numeric = foo.bar_id)
        ->  Seq Scan on bar  (cost=0.00..4328.00 rows=300000 width=4)
(actual time=0.025..402.458 rows=300000 loops=1)
        ->  Hash  (cost=24636.81..24636.81 rows=82425 width=8) (actual
time=8795.027..8795.027 rows=2097152 loops=1)
              ->  Bitmap Heap Scan on foo  (cost=1565.44..24636.81
rows=82425 width=8) (actual time=670.248..5098.109 rows=2097152 loops=1)
                    Recheck Cond: ((keywords && '{ford}'::text[]) OR
(keywords && '{toyota}'::text[]) OR (keywords && '{volkswagen}'::text[])
OR (keywords && '{saturn}'::text[]) OR (keywords && '{honda}'::text[])
OR (keywords && '{porsche}'::text[]) OR (keywords && '{hummer}'::text[])
OR (keywords && '{ferrari}'::text[]))
                    ->  BitmapOr  (cost=1565.44..1565.44 rows=83879
width=0) (actual time=665.516..665.516 rows=0 loops=1)
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=114.013..114.013
rows=262144 loops=1)
                                Index Cond: (keywords && '{ford}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=72.398..72.398
rows=262144 loops=1)
                                Index Cond: (keywords && '{toyota}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=74.118..74.118
rows=262144 loops=1)
                                Index Cond: (keywords &&
'{volkswagen}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=58.486..58.486
rows=262144 loops=1)
                                Index Cond: (keywords && '{saturn}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=114.671..114.671
rows=524288 loops=1)
                                Index Cond: (keywords && '{honda}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=115.290..115.290
rows=524288 loops=1)
                                Index Cond: (keywords &&
'{porsche}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=58.184..58.184
rows=262144 loops=1)
                                Index Cond: (keywords && '{hummer}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=58.336..58.336
rows=262144 loops=1)
                                Index Cond: (keywords &&
'{ferrari}'::text[])
Total runtime: 12628.401 ms
-- seq 03 (OR+EA)

-- seq 04 (CC+EA)
Aggregate  (cost=26726.37..26726.38 rows=1 width=0) (actual
time=12700.620..12700.621 rows=1 loops=1)
  ->  Hash Join  (cost=17879.62..26722.62 rows=1500 width=0) (actual
time=8482.572..12272.449 rows=300000 loops=1)
        Hash Cond: ((bar.id)::numeric = foo.bar_id)
        ->  Seq Scan on bar  (cost=0.00..4328.00 rows=300000 width=4)
(actual time=0.029..412.984 rows=300000 loops=1)
        ->  Hash  (cost=17748.56..17748.56 rows=10485 width=8) (actual
time=8481.524..8481.524 rows=2097152 loops=1)
              ->  Bitmap Heap Scan on foo  (cost=177.69..17748.56
rows=10485 width=8) (actual time=978.464..4679.954 rows=2097152 loops=1)
                    Recheck Cond: (keywords &&
'{ford,toyota,volkswagen,saturn,honda,porsche,hummer,ferrari}'::text[])
                    ->  Bitmap Index Scan on foo_idx  (cost=0.00..175.07
rows=10485 width=0) (actual time=973.569..973.569 rows=2097152 loops=1)
                          Index Cond: (keywords &&
'{ford,toyota,volkswagen,saturn,honda,porsche,hummer,ferrari}'::text[])
Total runtime: 12700.838 ms
-- seq 04 (CC+EA)

As far as I tried, the runtime difference between the 2 forms varies up
to 10% and always under 1 second--ie. far away from the 100% and (15-8=)
7 seconds reported. Moreover the form "arr&&{f,b}" (let us call it: form
1) often shows better than the other (let us call it: form 2)--already
noticed this fact by previous measures without joins.

Exception occurs in a special case: for the first query submitted
immediately after populating tables. It seems as if indexes were not
optimized at this time. But all the following queries, whatever form is
considered, run in lesser and similar time.

When observing the runtime of intermediate steps (bitmap heap scan on
table foo [step 1], sequential scan on table bar [step 2], hash join
[step 3]), it can be noted that all 3 steps run in a comparable time
whatever form used:
- no relevant difference for steps 2 and 3, but an attended long time
for retrieving all the resulting rows because of their high number;
- for step 1: form 1 (arr&&{f,b}) needs more time to extract the 1st row
of the resultset but is faster to collect the last rows. So no
significant difference may be also revealed here.

As I mentioned, I noticed that the runtime is longer for the 1st query
performed after populating tables and that, since the 2nd query
performed, runtimes clearly decrease towards a mean (about 6 seconds on
my side). Maybe the runtimes reported (15 and 8 seconds respectively)
occur in such a case (just after populating tables) with a performing
order starting with form 1. As much as I saw, the sequence order acts
upon the runtime reduction. Moreover successive queries tend to mean and
equal runtimes with the 2 forms.

About the user complaint, I would be not so astonished because of the
duration of one query (more than 5 seconds in the 2 cases). There are so
many rows to select that it seems not really possible to get a lesser
runtime.

What I may conclude: query plans show difference in performing the
queries: each Where clause is assessed before a commun heap scan of
table foo. The Where clause of the form 1 (multi-value text table) is
longer to run because of multi-value to test for each key; the Where
clause of the form 2 (1-value text table) is faster but the addition of
the corresponding 1-value text table Where clauses runs in a comparable
time. This observation resulted from measures on my sides and may not
correctly reflect some proper configurations of the real production
environment.

Opting to form 2 (arr&&{f} or arr&&{b}) will surely not make runtime
worst. So it would not be a bad choice to implement it if convinced.

Another way could concern the hash join. It has been shown that this
step costs a lot with respect to the overall runtime. Depending on
available storage space and DBMS load, a kind of materialized view may
be handled in order to cut off the overloading join. Here are some
suggested statements to create this helper table:

-- Creates a table jfoobar wrt. tables foo and bar
-- (max 10 seconds elapsed on my platform)
CREATE OR REPLACE TABLE jfoobar AS
SELECT
    foo.bar_id, foo.keywords
FROM foo
INNER JOIN bar ON foo.bar_id = bar.id;
-- Creates an index for jfoobar
-- (max 3 seconds elapsed on my platform)
--  this way...
CREATE INDEX jfoobar_idx ON jfoobar(bar_id, keywords);
--  or this one...
DROP INDEX jfoobar_idx;
CREATE INDEX jfoobar_idx ON jfoobar USING gin(keywords);
--

This table may be updated or replaced on a regular time base or by
triggered event on updating foo or bar tables.

Finally, any query of the form 1 or 2 will run in less than 300 ms:
-- form 1: unique text table
SELECT    count(*) FROM jfoobar
WHERE
    keywords && '{ford, toyota, volkswagen, saturn, honda, porsche,
hummer, ferrari}'::text[];

-- form 2: developped text table
SELECT    count(*) FROM jfoobar
WHERE
    (
        keywords && '{ford}'::text[]
        OR keywords && '{toyota}'::text[]
        OR keywords && '{volkswagen}'::text[]
        OR keywords && '{saturn}'::text[]
        OR keywords && '{honda}'::text[]
        OR keywords && '{porsche}'::text[]
        OR keywords && '{hummer}'::text[]
        OR keywords && '{ferrari}'::text[]
    );
--

The runtime would only depend on the update strategy for the join table.

Regards.

--
nha / Lyon / France.

В списке pgsql-general по дате отправления:

Предыдущее
От: Andreas Wenk
Дата:
Сообщение: Re: Password?
Следующее
От: Ms swati chande
Дата:
Сообщение: Re: Password?