Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

Поиск

Список

Период

Сортировка

От	Jeff Davis
Тема	Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
Дата	12 июля 2014 г. 02:16:07
Msg-id	1405131377.9081.244.camel@jeff-desktop обсуждение исходный текст
Ответ на	Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
Список	pgsql-hackers

Дерево обсуждения

On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > Attached is a small patch to $SUBJECT.
> > In master, only single-byte characters are allowed as an escape. Of
> > course, with the patch it must still be a single character, but it may
> > be multi-byte.
>
> I'm concerned about the performance cost of this patch.  Have you done
> any measurements about what kind of overhead you are putting on the
> inner loop of similar_escape?

I didn't consider this very performance critical, because this is
looping through the pattern, which I wouldn't expect to be a long
string. On my machine using en_US.UTF-8, the difference is imperceptible
for a SIMILAR TO ... ESCAPE query.

I was able to see about a 2% increase in runtime when using the
similar_escape function directly. I made a 10M tuple table and did:

    explain analyze
      select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t;

which was the worst reasonable case I could think of. (It appears that
selecting from a table is faster than from generate_series. I'm curious
what you use when testing the performance of an individual function at
the SQL level.)

> At the very least, please don't call GetDatabaseEncoding() over again
> every single time through the inner loop.  More generally, why do we
> need to use pg_encoding_verifymb?  The input data is presumably validly
> encoded.  ISTM the significantly cheaper pg_mblen() would be more
> appropriate.

Thank you. Using the non-verifying variants reduces the penalty in the
above test to 1%.

If needed, we could optimize further through code specialization, as
like_escape() does. Though I think like_escape() is specialized just
because MatchText() is specialized; and MatchText is specialized because
it operates on the actual data, not just the pattern. So I don't see a
reason to specialize similar_escape().

Regards,
    Jeff Davis

Вложения

similar-escape.patch

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

Вложения