Re: Improve the performance of Unicode Normalization Forms.
От | Jeff Davis |
---|---|
Тема | Re: Improve the performance of Unicode Normalization Forms. |
Дата | |
Msg-id | f3fee95b221846b7ab6b08ddfad039518c6aa97b.camel@j-davis.com обсуждение исходный текст |
Ответ на | Re: Improve the performance of Unicode Normalization Forms. (Alexander Borisov <lex.borisov@gmail.com>) |
Ответы |
Re: Improve the performance of Unicode Normalization Forms.
|
Список | pgsql-hackers |
On Mon, 2025-08-11 at 17:21 +0300, Alexander Borisov wrote: > As a result, I would not look into ICU at the moment, especially > since > we have a different approach. > I am currently working on optimizing unicode_normalize(). > I am trying to come up with an improved version of the algorithm in C > by the next commitfest (which will be in September). Agreed, but thank you for adding context so I can understand where we are. The patch as proposed is a speedup, and also a simplification because it eliminates the different code path for the frontend code. That also makes me feel better about testing, because I don't think both those paths were tested equally. Comments on the patch itself: The 0001 patch generalizes the two-step lookup process: first navigate branches to find the index into a partially-compacted sparse array, and then use that to get the index into the dense array. The branching code, the sparse array, and the dense array are all generated code. The reason for the two-step lookup is to keep the sparse array element size small (uint16), so that missing elements consume less space (missing elements still remain for small gaps). The full entry is kept in the dense array. GenerateSparseArray.pm would be more descriptive than "Ranges.pm" for the new module. And we should probably include "sparse" in the name of the sparse arrays. The new module is responsible for generating the branching code as well as the sparse array; while the caller is reponsible for the dense arrays. For case mapping, one sparse array is used for four parallel arrays for the different case kinds (lower/title/upper/fold). The use of zero values for different purposes is getting confusing. It means "doesn't exist", but we are also reserving the zeroth element in the arrays. Would it be easier to just "#define EMPTY 0xFFFF" and then have the caller check for it? That way we don't need to reserve the zeroth array element, which should make it easier to avoid off-by-one errors. I think we can simplify the interface, as well. Why does the caller need to separately generate the ranges, then generate the table, then generate the branches? It's really all the same action and can be based on an input hash with a certain structure, and then return both the table and the branches, right? Regards, Jeff Davis
В списке pgsql-hackers по дате отправления: