Re: [GENERAL] Problem (bug?) with like

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: [GENERAL] Problem (bug?) with like
Дата
Msg-id 20011204171627.GB450@lorien.net
обсуждение исходный текст
Ответы Re: [GENERAL] Problem (bug?) with like  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: [GENERAL] Problem (bug?) with like  (Bruce Momjian <pgman@candle.pha.pa.us>)
Список pgsql-hackers
bombadil@wanadoo.es writes:
> 1) # explain SELECT * from v_cliente where nombre like '%DA%';

> Merge Join  (cost=54763.50..62874.36 rows=413980 width=183)
>   ->  Sort  (cost=16238.44..16238.44 rows=54 width=131)
>         ->  Seq Scan on cliente c  (cost=0.00..16236.88 rows=54 width=131)
>   ->  Sort  (cost=38525.06..38525.06 rows=20097 width=74)
>         ->  Subquery Scan d  (cost=891.91..37088.66 rows=20097 width=74)
>               ->  Hash Join  (cost=891.91..37088.66 rows=20097 width=74)
>                   ...

> 2) explain SELECT * from v_cliente where nombre like '%DAVID%';

> Nested Loop  (cost=891.91..61414.58 rows=7638 width=183)
>   ->  Seq Scan on cliente c  (cost=0.00..16236.88 rows=1 width=131)
>   ->  Subquery Scan d  (cost=891.91..37088.66 rows=20097 width=74)
>         ->  Hash Join  (cost=891.91..37088.66 rows=20097 width=74)
>             ... [same subplan as above]

The problem here is that the planner is being way too optimistic about
the selectivity of LIKE '%DAVID%' --- notice the estimate that only
one matching row will be found in cliente, rather than 54 as with '%DA%'.
So it chooses a plan that avoids the sort overhead needed for an
efficient merge join with the other tables.  That would be a win if
there were only one matching row, but as soon as there are lots, it's
a big loss, because the subquery to join the other tables gets redone
for every matching row :-(

>> Also, how many rows are there really that match '%DA%' and '%DAVID%'?

>  1)   2672 rows    -> 3.59 sec.
>  2)   257 rows     -> 364.69 sec.

I am thinking that the rules for selectivity of LIKE patterns probably
need to be modified.  Presently the code assumes that a long constant
string has probability of occurrence proportional to the product of the
probabilities of the individual letters.  That might be true in a random
world, but people don't search for random strings.  I think we need to
back off the selectivity estimate by some large factor to account for
the fact that the pattern being searched for is probably not random.
Anyone have ideas how to do that?

            regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: FW: [CYGWIN] 7.2b3 postmaster doesn't start on Win98
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: [GENERAL] Problem (bug?) with like