Обсуждение: "like" vs "substring" again

Поиск
Список
Период
Сортировка

"like" vs "substring" again

От
Christian Schröder
Дата:
Hi list,

last week I asked a question about a query with several joins and a
"like" operator which was really slow. When I replaced "like" with
"substring" (which was possible because the comparison was simply "bla
like '123%'") the query became extremely faster because the query
optimizer came to another plan.

Gregory Stark gave me the hint (thanks, Gregory!) that the query
optimizer made wrong assumptions about the selectivity of the "like".
When I used "substring" the assumptions became better and so it chose a
better (faster) plan. I then increased the statistics target of the
column and the query with "like" became as fast as when I used
"substring". So far, so good.

Now I have a similar problem: I have a query (which doesn't look too
complicated to me) which takes about 4.5 hours on a 2 GHz Dual-Core Xeon
machine. The query joins several tables and has two comparisons, one
with "not like", the other with "substring(...) = ". When I use "like"
and "not like" together or "substring(...) = " and "substring(...) <> "
together, the query takes about 5 seconds. The plan is identical in both
cases and different to the plan when I mix the comparisons. The most
obvious difference is the number of rows the query optimizer expects to
get from the table which is filtered: It expects 1 if I mix the
comparison operators and 84 if I consistently use "like" or "substring".
The real number of selected rows is 1667 (from a total of 2884 rows), so
both estimations are rather wrong. Note that this is exactly the same
column for which I increased the statistics target to 500 after last
week's discussion ...

I then set up a test table with the problematic column and filled it
with the same data. The test table looks as follows:

       Table "pg_temp_3.temp"
 Column |     Type      | Modifiers
--------+---------------+-----------
 test   | character(10) | not null

I set the statistics target to 1000 for this column and ran the
following queries:

explain analyze select * from temp where test like '11%' and test not like '113%';
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Seq Scan on "temp"  (cost=0.00..62.26 rows=39 width=14) (actual time=0.012..1.229 rows=1678 loops=1)
   Filter: ((test ~~ '11%'::text) AND (test !~~ '113%'::text))
 Total runtime: 1.655 ms
(3 rows)

explain analyze select * from temp where substring(test from 1 for 2) = '11' and substring(test from 1 for 3) <> '113';
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on "temp"  (cost=0.00..91.10 rows=14 width=14) (actual time=0.020..3.282 rows=1678 loops=1)
   Filter: (("substring"((test)::text, 1, 2) = '11'::text) AND ("substring"((test)::text, 1, 3) <> '113'::text))
 Total runtime: 3.719 ms
(3 rows)

explain analyze select * from temp where substring(test from 1 for 2) = '11' and test not like '113%';
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Seq Scan on "temp"  (cost=0.00..76.68 rows=1 width=14) (actual time=0.018..2.469 rows=1678 loops=1)
   Filter: (("substring"((test)::text, 1, 2) = '11'::text) AND (test !~~ '113%'::text))
 Total runtime: 2.914 ms
(3 rows)

As far as I understand, all queries are semantically identical and have
the same result set. However, the query optimizer makes very different
estimations about the number of rows the queries would return. All the
estimations are far from reality, and at least the last one leads to
fatal results when this "where" clause is part of a more complex query.

So I have the following questions:

   1. Why does the query optimizer not recognize that the expressions
      are equivalent?
   2. What can I do to improve the estimation of the query optimizer? I
      tried to create an index (with opclass "bpchar_pattern_ops") which
      was actually used in the first query, but did not improve the
      estimation or the execution speed.

Thanks again for any help!

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer



Performance Issues (was: "like" vs "substring" again)

От
Christian Schröder
Дата:
Hi list,
I am still fighting with the really slow database queries (see
http://www.nabble.com/%22like%22-vs-%22substring%22-again-t4447906.html),
and I still believe that the cause of the problem is that the query
planner makes incorrect estimations about the selectivity of the "where"
clauses.
I wondered if it is possible to make the query planner perform a
sequential scan over a table *before* it starts planning? If I know that
a table has only about 3000 rows, the overhead due to this sequential
scan can be ignored. On the other hand, this would give the planner an
exact data basis for his planning.
Or would it be possible to tweak how the planner determines the
selectivity? I have read in the docs (chapter 54.1) that in case of more
than one condition in the where clause, independency is assumed. In my
case ("... where test like '11%' and test not like '113%'") this is
clearly not the case, so it might be an interesting point to address.
Do you have any other tips for me?

Kind regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer



Re: Performance Issues (was: "like" vs "substring" again)

От
"John D. Burger"
Дата:
Christian Schröder wrote:

> Or would it be possible to tweak how the planner determines the
> selectivity? I have read in the docs (chapter 54.1) that in case of
> more than one condition in the where clause, independency is
> assumed. In my case ("... where test like '11%' and test not like
> '113%'") this is clearly not the case, so it might be an
> interesting point to address.

I think the planner does think about the interactions of
inequalities, so if you can express your query with less-than and
friends, or even with BETWEEN, you might get a better plan.  I don't
know the details of your setup, but you can do things like this with
any ordered type:

    where test between '11' and '113'
    or test >= '114'

I know this does not match the exact semantics of your query, but
hopefully you get the idea.

- John D. Burger
   MITRE

Re: Performance Issues

От
Christian Schröder
Дата:
John D. Burger wrote:
> Christian Schröder wrote:
>
>> Or would it be possible to tweak how the planner determines the
>> selectivity? I have read in the docs (chapter 54.1) that in case of
>> more than one condition in the where clause, independency is assumed.
>> In my case ("... where test like '11%' and test not like '113%'")
>> this is clearly not the case, so it might be an interesting point to
>> address.
>
> I think the planner does think about the interactions of inequalities,
> so if you can express your query with less-than and friends, or even
> with BETWEEN, you might get a better plan.  I don't know the details
> of your setup, but you can do things like this with any ordered type:
>
>    where test between '11' and '113'
>     or test >= '114'
>
> I know this does not match the exact semantics of your query, but
> hopefully you get the idea.

There are two drawbacks of this solution:

   1. It is not always possible to rewrite the "like" or "substring"
      queries with standard relational operators.
   2. It is annoying for my users that they have to tewak the query
      until they find a solution that takes 5 seconds to finish instead
      of 4 hours.

I think it is my job as db admin to make the database work the way my
users need it, and not the user's job to find a solution that fits the
database's needs ...

Is there really nothing that I can do?

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer



Re: Performance Issues

От
Alvaro Herrera
Дата:
Christian Schröder wrote:

> I think it is my job as db admin to make the database work the way my users
> need it, and not the user's job to find a solution that fits the database's
> needs ...
>
> Is there really nothing that I can do?

You can improve the selectivity estimator function.  One idea is that if
you are storing something that's not really a general character string,
develop a specific datatype, with a more precise selectivity estimator.
If you are you up to coding in C, that is.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Performance Issues

От
Christian Schröder
Дата:
Alvaro Herrera wrote:
> Christian Schröder wrote:
>
>
>> I think it is my job as db admin to make the database work the way my users
>> need it, and not the user's job to find a solution that fits the database's
>> needs ...
>>
>> Is there really nothing that I can do?
>>
>
> You can improve the selectivity estimator function.  One idea is that if
> you are storing something that's not really a general character string,
> develop a specific datatype, with a more precise selectivity estimator.
> If you are you up to coding in C, that is.
>

Hm, that sounds interesting! I will definitely give it a try.
Will that also solve the problem of combining more than one of these
conditions? As far as I can see, the main issue at the moment is that we
often have "... where test like '11%' and test not like '113%'" in our
queries. Even if the selectivity estimation of the single condition will
be improved, it will still be wrong to multiply the selectivities.

I think I will have a look at the src/backend/optimizer/util/plancat.c,
src/backend/optimizer/path/clausesel.c and
src/backend/utils/adt/selfuncs.c files after my holiday.

Kind regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer



Re: Performance Issues

От
Alvaro Herrera
Дата:
Christian Schröder wrote:
> Alvaro Herrera wrote:
>> Christian Schröder wrote:
>>
>>
>>> I think it is my job as db admin to make the database work the way my
>>> users need it, and not the user's job to find a solution that fits the
>>> database's needs ...
>>>
>>> Is there really nothing that I can do?
>>>
>>
>> You can improve the selectivity estimator function.  One idea is that if
>> you are storing something that's not really a general character string,
>> develop a specific datatype, with a more precise selectivity estimator.
>> If you are you up to coding in C, that is.
>>
>
> Hm, that sounds interesting! I will definitely give it a try.
> Will that also solve the problem of combining more than one of these
> conditions? As far as I can see, the main issue at the moment is that we
> often have "... where test like '11%' and test not like '113%'" in our
> queries. Even if the selectivity estimation of the single condition will be
> improved, it will still be wrong to multiply the selectivities.

Unless you can come up with an operator that expresses better the
"starts with 11 but not with 113" type of condition.  For example if
these were telephone number prefixes or something like that, probably
there's some way to do that in a single operation instead of two, and
the selectivity function could produce a much more accurate estimate
saving the need to multiply.

--
Alvaro Herrera                        http://www.advogato.org/person/alvherre
"I think my standards have lowered enough that now I think 'good design'
is when the page doesn't irritate the living f*ck out of me." (JWZ)

Re: Performance Issues

От
"Peter Childs"
Дата:


On 23/09/2007, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Christian Schröder wrote:
> Alvaro Herrera wrote:
>> Christian Schröder wrote:
>>
>>
>>> I think it is my job as db admin to make the database work the way my
>>> users need it, and not the user's job to find a solution that fits the
>>> database's needs ...
>>>
>>> Is there really nothing that I can do?
>>>
>>
>> You can improve the selectivity estimator function.  One idea is that if
>> you are storing something that's not really a general character string,
>> develop a specific datatype, with a more precise selectivity estimator.
>> If you are you up to coding in C, that is.
>>
>
> Hm, that sounds interesting! I will definitely give it a try.
> Will that also solve the problem of combining more than one of these
> conditions? As far as I can see, the main issue at the moment is that we
> often have "... where test like '11%' and test not like '113%'" in our
> queries. Even if the selectivity estimation of the single condition will be
> improved, it will still be wrong to multiply the selectivities.

Unless you can come up with an operator that expresses better the
"starts with 11 but not with 113" type of condition.  For example if
these were telephone number prefixes or something like that, probably
there's some way to do that in a single operation instead of two, and
the selectivity function could produce a much more accurate estimate
saving the need to multiply.


select a from b where a ~ '^11[^3]'

Is that what you want?

I usually find using ~ far better than like.....

Peter Childs
 

--
Alvaro Herrera                        http://www.advogato.org/person/alvherre
"I think my standards have lowered enough that now I think 'good design'
is when the page doesn't irritate the living f*ck out of me." (JWZ)

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@postgresql.org so that your
       message can get through to the mailing list cleanly