Re: [PATCH] Completed unaccent dictionary with many missing characters

Поиск

Список

Период

Сортировка

От	Przemysław Sztoch
Тема	Re: [PATCH] Completed unaccent dictionary with many missing characters
Дата	5 июля 2022 г. 19:24:49
Msg-id	4c9326a1-6554-262f-1f22-e636933086ed@sztoch.pl обсуждение исходный текст
Ответ на	Re: [PATCH] Completed unaccent dictionary with many missing characters (Michael Paquier <michael@paquier.xyz>)
Ответы	Re: [PATCH] Completed unaccent dictionary with many missing characters Re: [PATCH] Completed unaccent dictionary with many missing characters
Список	pgsql-hackers

Дерево обсуждения

Michael Paquier wrote on 7/5/2022 9:22 AM:

On Tue, Jun 28, 2022 at 02:14:53PM +0900, Michael Paquier wrote:

Well, the addition of cyrillic does not make necessary the removal of
SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a
dictionnary when manipulating the set of codepoints, but that's me
being too picky.  Just to say that I am fine with what you are
proposing here.

So, I have been looking at the change for cyrillic letters, and are
you sure that the range of codepoints [U+0410,U+044f] is right when it
comes to consider all those letters as plain letters?  There are a
couple of characters that itch me a bit with this range:
- What of the letter CAPITAL SHORT I (U+0419) and SMALL SHORT I
(U+0439)?  Shouldn't U+0439 be translated to U+0438 and U+0419
translated to U+0418?  That's what I get while looking at
UnicodeData.txt, and it would mean that the range of plain letters
should not include both of them.

1. It's good that you noticed it. I missed it. But it doesn't affect the generated rule list.

- It seems like we are missing a couple of letters after U+044F, like
U+0454, U+0456 or U+0455 just to name three of them?

2. I added a few more letters that are used in languages other than Russian: Byelorussian or Ukrainian.

-                       (0x0410, 0x044f),      # Cyrillic capital and small letters
+                       (0x0402, 0x0402),      # Cyrillic capital and small letters
+                       (0x0404, 0x0406),      #
+                       (0x0408, 0x040b),      #
+                       (0x040f, 0x0418),      #
+                       (0x041a, 0x0438),      #
+                       (0x043a, 0x044f),      #
+                       (0x0452, 0x0452),      #
+                       (0x0454, 0x0456),      #

I do not add more, because they probably concern older languages.
An alternative might be to rely entirely on Unicode decomposition ...
However, after the change, only one additional Ukrainian letter with an accent was added to the rule file.


I have extracted from 0001 and applied the parts about the regression
tests for degree signs, while adding two more for SOUND RECORDING
COPYRIGHT (U+2117) and Black-Letter Capital H (U+210C) translated to
'x', while it should be probably 'H'.

3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode decomposition),
then "U + 33D7" changes not to pH but to PH.
In the end, I left it like it was before ...

If you decide what to do with point 3, I will correct it and send new patches.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [PATCH] Completed unaccent dictionary with many missing characters