Unaccent performance

Поиск
Список
Период
Сортировка
От Thom Brown
Тема Unaccent performance
Дата
Msg-id CAA-aLv41dV78AxFWZgXdYdi9-zJVbhVyx6DijGfqeJ-87UGjdw@mail.gmail.com
обсуждение исходный текст
Ответы Re: Unaccent performance  (Thom Brown <thom@linux.com>)
Список pgsql-hackers
Hi,

The unaccent extension is great, especially with its customisability, but it's not always easy to recommend.  I witnessed a customer using no less than 56 nested replace functions in an SQL function.  I looked to see how much this can be mitigated by unaccent.  It turns out that not all the characters they were replacing can be replaced by unaccent, either because they replace more than 1 character at a time, or the character they're replacing, for some reason, isn't processed by unaccent, even with a custom rules file.

So there were 20 characters I could identify that they were replacing.  I made a custom rules file and compared its performance to the difficult-to-manage set of nested replace calls.

CREATE OR REPLACE FUNCTION public.myunaccent(sometext text)
 RETURNS text
 LANGUAGE sql
 IMMUTABLE
AS $function$
SELECT replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A')
;
$function$

postgres=# SELECT myunaccent(sometext::text) FROM (SELECT 'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,1000000)) x OFFSET 999999 LIMIT 1;
      myunaccent      
----------------------
 AAAAAAAaaaaaaaAaAaAa
(1 row)

Time: 726.282 ms
postgres=# SELECT unaccent(sometext::text) FROM (SELECT 'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,1000000)) x OFFSET 999999 LIMIT 1;
       unaccent       
----------------------
 AAAAAAAaaaaaaaAaAaAa
(1 row)

Time: 3305.252 ms

The timings are actually pretty much the same even if I introduce 187 nested replace calls for every line in the unaccent.rules file for 187 characters.  But the same character set with unaccent increases to 7418.526 ms with the same type of query as above.  That's 10 times more expensive.

Is there a way to boost the performance to make its adoption more palatable?

--
Thom

В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Fetter
Дата:
Сообщение: Review [was Re: MD5 aggregate]
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls