Re: [PATCH] Completed unaccent dictionary with many missing characters

Поиск
Список
Период
Сортировка
От Przemysław Sztoch
Тема Re: [PATCH] Completed unaccent dictionary with many missing characters
Дата
Msg-id d0e93206-d7cd-37b8-a79b-7ddf73457e02@sztoch.pl
обсуждение исходный текст
Ответ на Re: [PATCH] Completed unaccent dictionary with many missing characters  (Michael Paquier <michael@paquier.xyz>)
Ответы Re: [PATCH] Completed unaccent dictionary with many missing characters  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers


Michael Paquier wrote on 20.06.2022 03:49:
On Wed, Jun 15, 2022 at 01:01:37PM +0200, Przemysław Sztoch wrote:
Two fixes (bad comment and fixed Latin-ASCII.xml).
         if codepoint.general_category.startswith('L') and \
-           len(codepoint.combining_ids) > 1:
+           len(codepoint.combining_ids) > 0:
So, this one checks for the case where a codepoint is within the
letter category.  As far as I can see this indeed adds a couple of
characters, with a combination of Greek and Latin letters.  So that
looks fine.
Previously, there were only multi-letter conversions. Now we also have single letters.

+        elif codepoint.general_category.startswith('N') and \
+           len(codepoint.combining_ids) > 0 and \
+           args.noLigaturesExpansion is False and is_ligature(codepoint, table):
+            charactersSet.add((codepoint.id,
+                               "".join(chr(combining_codepoint.id)
+                                       for combining_codepoint
+                                       in get_plain_letters(codepoint, table))))
And this one is for the numerical part of the change.  Do you actually
need to apply is_ligature() here?  I would have thought that this only
applies to letters.
But ligature check is performed on combining_ids (result of translation), not on base codepoint.
Without it, you will get assertions in get_plain_letters.

The idea is that we take translations that turn into normal letters. Others (strange) are rejected.
Maybe it could be done better. I didn't like it as much as you did, but I couldn't do better.
In the end, I left it just like in the original script.

Note that the plain letter list (PLAIN_LETTER_RANGES) has now been expanded with numbers.
-    assert(False)
+    assert False, 'Codepoint U+%0.2X' % codepoint.id
[...]
-    assert(is_ligature(codepoint, table))
+    assert is_ligature(codepoint, table), 'Codepoint U+%0.2X' % codepoint.id
These two are a good idea for debugging.

-    return all(is_letter(table[i], table) for i in codepoint.combining_ids)
+    return all(i in table and is_letter(table[i], table) for i in codepoint.combining_ids)
It looks like this makes the code weaker, as we would silently skip
characters that are not part of the table rather than checking for
them all the time?
Unfortunately, there are entries in combining_ids that are not in the character table being used.
This protection is necessary so that there is no error. But unfamiliar characters are omitted.
While recreating unaccent.rules with your patch, I have noticed what
looks like an error.  An extra rule mapping U+210C (black-letter
capital h) to "x" gets added on top of te existing one for "H", but
the correct answer is the existing rule, not the one added by the
patch.
The problem with the sign of U+210C is that there are conflicting translations for it.
As the name suggests "(black-letter capital h)", it should be converted to a capital H.
However, the current Latin-ASCII.xml suggests a conversion to x.
I found an open discussion on the internet about this and the suggestion that the Latin-ASCII.xml file should be corrected for this letter.
But I wouldn't expect that Unicode makes the revised Latin-ASCII.xml quickly into the official repo.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Следующее
От: Matthias van de Meent
Дата:
Сообщение: Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths