Re: NOT LIKE much faster than LIKE?

От: Simon Riggs
Тема: Re: NOT LIKE much faster than LIKE?
Дата: ,
Msg-id: 1137026916.21025.585.camel@localhost.localdomain
(см: обсуждение, исходный текст)
Ответ на: Re: NOT LIKE much faster than LIKE?  (Tom Lane)
Список: pgsql-performance

Скрыть дерево обсуждения

NOT LIKE much faster than LIKE?  (Andrea Arcangeli, )
 Re: NOT LIKE much faster than LIKE?  (Tom Lane, )
  Re: NOT LIKE much faster than LIKE?  (Andrea Arcangeli, )
   Re: NOT LIKE much faster than LIKE?  (Christopher Kings-Lynne, )
    Re: NOT LIKE much faster than LIKE?  (Andrea Arcangeli, )
   Re: NOT LIKE much faster than LIKE?  (Tom Lane, )
    Re: NOT LIKE much faster than LIKE?  (Andrea Arcangeli, )
     Re: NOT LIKE much faster than LIKE?  (Greg Stark, )
    Re: NOT LIKE much faster than LIKE?  (Matteo Beccati, )
     Re: NOT LIKE much faster than LIKE?  (Tom Lane, )
      Re: NOT LIKE much faster than LIKE?  (Simon Riggs, )
       Re: NOT LIKE much faster than LIKE?  (Tom Lane, )
        Re: NOT LIKE much faster than LIKE?  (Simon Riggs, )
         Re: NOT LIKE much faster than LIKE?  (Tom Lane, )
          Re: NOT LIKE much faster than LIKE?  (Simon Riggs, )
           Re: NOT LIKE much faster than LIKE?  (Andrea Arcangeli, )
        Re: NOT LIKE much faster than LIKE?  (Simon Riggs, )
   Re: NOT LIKE much faster than LIKE?  (Stephan Szabo, )
 Re: NOT LIKE much faster than LIKE?  (Andrea Arcangeli, )
  Re: NOT LIKE much faster than LIKE?  (Tom Lane, )
   Re: NOT LIKE much faster than LIKE?  (Andrea Arcangeli, )
 Re: NOT LIKE much faster than LIKE?  ("Jim C. Nasby", )
  Re: NOT LIKE much faster than LIKE?  (Andrea Arcangeli, )
   Re: NOT LIKE much faster than LIKE?  (Andrea Arcangeli, )
   Re: NOT LIKE much faster than LIKE?  ("Jim C. Nasby", )
    Re: NOT LIKE much faster than LIKE?  (Tom Lane, )

On Tue, 2006-01-10 at 17:21 -0500, Tom Lane wrote:
> Simon Riggs <> writes:
> > I think its OK to use the MCV, but I have a problem with the current
> > heuristics: they only work for randomly generated strings, since the
> > selectivity goes down geometrically with length.
>
> We could certainly use a less aggressive curve for that.  You got a
> specific proposal?

non-left anchored LIKE is most likely going to be used with
unstructured, variable length data - else we might use SUBSTRING
instead. My proposal would be to assume that LIKE is acting on human
language text data.

I considered this a while back, but wrote it off in favour of dynamic
sampling - but it's worth discussing this to see whether we can improve
on things without that.

Here's one of the links I reviewed previously:
http://www.ling.lu.se/persons/Joost/Texts/studling.pdf
Sigurd et al [2004]

This shows word frequency distribution peaks at 3 letter/2 phoneme
words, then tails off exponentially after that.

Clearly when search string > 3 then the selectivity must tail off
exponentially also, since we couldn't find words shorter than the search
string itself. The search string might be a phrase, but it seems
reasonable to assume that phrases also drop off in frequency according
to length. It is difficult to decide what to do at len=2 or len=3, and I
would be open to various thoughts, but would default to keeping
like_selectivity as it is now.

Sigurd et al show that word length tails off at 0.7^Len beyond Len=3, so
selectivity FIXED_CHAR_SEL should not be more than 0.7, but I see no
evidence for it being as low as 0.2 (from the published results).  For
simplicity, where Len > 3, I would make the tail off occur with factor
0.5, rather than 0.2.

We could see a few more changes from those results, but curbing the
aggressive tail off would be a simple and easy act.

Best Regards, Simon Riggs



В списке pgsql-performance по дате сообщения:

От: Tom Lane
Дата:
Сообщение: Re: indexes on primary and foreign keys
От: Simon Riggs
Дата:
Сообщение: Re: Extremely irregular query performance