Re: Generalized edit function?

Поиск

Список

Период

Сортировка

От	Robert Haas
Тема	Re: Generalized edit function?
Дата	27 февраля 2011 г. 01:54:05
Msg-id	AANLkTi=NEHXY89xZhN1FCmpx=a7TeRgXWz313jxL8k6U@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Generalized edit function? (fork <forkandwait@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

On Sat, Feb 26, 2011 at 7:40 PM, fork <forkandwait@gmail.com> wrote:
>> Pre-9.1 levenshtein is ASCII-only, and I think some of the other stuff
>> in contrib/fuzzystrmatch still is.
>
> I am only looking at 9.0.3 for levenshtein, so I don't have any thoughts yet on
> multi-byteness so far.   I will have to figure out the multibyte character work
> once I get the basic algorithm working -- any thoughts on that?  Any pitfalls in
> porting?

The main thing with levenshtein() is that if you're working with
single byte characters then you can reference the i'th character as
x[i], whereas if you have multi-byte characters then you need to build
an offset table and look at length[i] bytes beginning at
&x[offset[i]].  That turns out to be significantly more expensive.  As
initially proposed, the patch to add multi-byte awareness built this
lookup table for any multi-byte encoding and used the faster technique
for single-byte encodings, but that wasn't actually so hot, because
the most widely used encoding these days is probably UTF-8, which of
course is a multi-byte encoding.  What we ended up with is a fast-path
for the case where both strings contain only single-byte characters,
which will always be true in a single-byte encoding but might easily
also be true in a multi-byte encoding, especially for English
speakers.  I don't know if that's exactly right for what you're trying
to do - you'll probably need to try some different things and
benchmark.  I would however recommend that you look at the
master-branch implementation of levenshtein() rather than the old
9.0.x one, because it's significantly different, and forward-porting
your changes will probably be hard.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Fujii Masao
Дата: 27 февраля 2011 г., 01:52:56
Сообщение: Re: Replication server timeout patch

Следующее

От: Daniel Farina
Дата: 27 февраля 2011 г., 02:23:26
Сообщение: Re: sync rep design architecture (was "disposition of remaining patches")

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Generalized edit function?

Предыдущее

Следующее