Improve the performance of Unicode Normalization Forms.
От | Alexander Borisov |
---|---|
Тема | Improve the performance of Unicode Normalization Forms. |
Дата | |
Msg-id | 844d3dd7-2955-4794-95d1-7f4c13cb89fc@gmail.com обсуждение исходный текст |
Список | pgsql-hackers |
Hi, hackers! As promised, I continue to improve/speed up Unicode in Postgres. Last time, we improved the lower(), upper(), and casefold() functions. [1] Now it's time for Unicode Normalization Forms, specifically the normalize() function. The attachment contains three patches: 1. Transfer of functions for building a range index to a separate Perl module. 2-small. In this patch, the mapping tables are reduced, but this increases the number of branches for searching for the desired value. 2-big. In this patch, the tables are large, but the number of branches for searching for the desired value is small. In other words, we choose between two patches, either small or big. What did we manage to achieve? 1. The uint32 field, which stored the unicode codepoint record, was removed from the main structure/table. This was necessary for perfect hashing. 2. Thanks to the first point, we managed to get rid of duplicates and reduce the main table from 6843 to 3943. 3. We managed to get rid of duplicates in the table with codepoints for decomposition from 5138 to 4304 (uint32). 4. These manipulations allowed us to increase the speed by up to 40% in some cases (benchmarks below). 5. The server part "lost weight" in the binary, but the frontend "gained weight" a little. I read the old commits, which say that the size of the frontend is very important and that speed is not important (speed is important on the server). I'm not quite sure what to do if this is really the case. Perhaps we should leave the slow version for the frontend. Benchmarks. How was it tested? Four files were created for each normalization form: NFC, NFD, NFKC, and NFKD. The files were sent via pgbench. The files contain all code points that need to be normalized. Unfortunately, the patches are already quite large, but if necessary, I can send these files in a separate email or upload them somewhere. Legend. Patch (big table) — this is a bloated mapping table, but with a small number of branches for searching. Patch (small table) — these are small tables, but with a large number of branches for searching. Without — the original Postgres without patches. * Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0) NFC: Patch (big table): tps = 5057.624509 Patch (small table): tps = 4727.158048 Without: tps = 3268.654867 NFD: Patch (big table): tps = 1145.730247 Patch (small table): tps = 1073.084633 Without: tps = 689.627239 NFKC: Patch (big table): tps = 4821.700702 Patch (small table): tps = 4662.640629 Without: tps = 3450.254581 NFKD: Patch (big table): tps = 1142.644341 Patch (small table): tps = 951.405893 Without: tps = 636.161496 How the size of object files and libraries has changed (in bytes): Patch (big table): unicode_norm.o: 138896 unicode_norm_srv.o: 191336 libpq.a: 364782 libpq.so.5.18: 480944 libpgcommon.a: 584432 libpgcommon_shlib.a: 568724 Patch (small table): unicode_norm.o: 86416 unicode_norm_srv.o: 138824 libpq.a: 364782 libpq.so.5.18: 423600 libpgcommon.a: 531952 libpgcommon_shlib.a: 516244 Without: unicode_norm.o: 79488 unicode_norm_srv.o: 165504 libpq.a: 364782 libpq.so.5.18: 419288 libpgcommon.a: 525024 libpgcommon_shlib.a: 509316 [1] https://www.postgresql.org/message-id/flat/7cac7e66-9a3b-4e3f-a997-42aa0c401f80%40gmail.com -- Best regards, Alexander Borisov
Вложения
В списке pgsql-hackers по дате отправления: