Re: Improve the performance of Unicode Normalization Forms.
От | Alexander Borisov |
---|---|
Тема | Re: Improve the performance of Unicode Normalization Forms. |
Дата | |
Msg-id | 16f87504-f174-450e-93cc-2db1074522bb@gmail.com обсуждение исходный текст |
Ответ на | Re: Improve the performance of Unicode Normalization Forms. (Jeff Davis <pgsql@j-davis.com>) |
Ответы |
Re: Improve the performance of Unicode Normalization Forms.
|
Список | pgsql-hackers |
19.06.2025 20:41, Jeff Davis wrote: > On Tue, 2025-06-03 at 00:51 +0300, Alexander Borisov wrote: >> As promised, I continue to improve/speed up Unicode in Postgres. >> Last time, we improved the lower(), upper(), and casefold() >> functions. [1] >> Now it's time for Unicode Normalization Forms, specifically >> the normalize() function. > > Did you compare against other implementations, such as ICU's > normalization functions? There's also a rust crate here: > > https://github.com/unicode-rs/unicode-normalization > > that might have been optimized. I don't quite see how this compares to the implementation on Rust. In the link provided, they use perfect hash, which I get rid of and get a x2 boost. If you take ICU implementations in C++, I have always considered them slow, at least when used in C code. I may well run benchmarks and compare the performance of the approach in Postgres and ICU. But this is beyond the scope of the patches under discussion. I want to emphasize that the pachty I gave doesn't change the normalization code/logic. We change the approach in finding the right codepoints across tables, which is what gives us the performance boost. > In addition to the lookups themselves, there are other opportunities > for optimization as well, such as: > > * reducing the need for palloc and extra buffers, perhaps by using > buffers on the stack for small strings > > * operate more directly on UTF-8 data rather than decoding and re- > encoding the entire string Absolutely agree with you, the normalization code is very well written but far from optimized. I didn't send changes in the normalization code itself to avoid lumping everything together and make the review easier. In keeping with my idea of optimizations in normalization forms, I planned to discuss the optimization code (C code) in the next iteration on “Improve performance...”. -- Regards, Alexander Borisov
В списке pgsql-hackers по дате отправления: