Re: refactor architecture-specific popcount code
| От | Nathan Bossart |
|---|---|
| Тема | Re: refactor architecture-specific popcount code |
| Дата | |
| Msg-id | aXJjbkp2_glyfy6z@nathan обсуждение исходный текст |
| Ответ на | Re: refactor architecture-specific popcount code (John Naylor <johncnaylorls@gmail.com>) |
| Ответы |
Re: refactor architecture-specific popcount code
|
| Список | pgsql-hackers |
On Thu, Jan 22, 2026 at 04:50:26PM +0700, John Naylor wrote: > 1) Nowadays, the only global call sites of the word-sized functions > are select_best_grantor() and in bitmapsets. The latter calls the > word-sized functions in a loop (could be just one word). It may be > more efficient to calculate the size in bytes and call pg_popcount(). Yeah, these seem like obvious places to use pg_popcount(). Note that bms_member_index() does a final popcount on a masked version of the last word. We could swap that with pg_popcount(), too, but it might be slower than just calling the word-sized function. However, it could be hard to tell the difference, as we'd be trading a function or function pointer call with an inlined loop over pg_number_of_ones. And even if it is slower, I'm not sure it matters all that much in the grand scheme of things. In any case, 0001 gets the easy ones out of the way. > Then we could get rid of all the pointer indirection for the > word-sized functions. Do you mean that we'd just keep the portable ones around? I see some code in pgvector that might be negatively impacted by that, but if I understand correctly it would require an unusual setup. > 2) The x86 byte buffer variants expend a lot of effort to detect > whether the buffer is aligned on both 64- and 32-bit platforms, with > an optimized path for each. At least 64-bit doesn't care about > alignment, and 32-bit doesn't warrant anything fancier than pure C. > Simultaneously, the aarch64 equivalent doesn't seem to take care about > alignment. (I think Nathan mentioned he didn't see a difference during > testing, but I wonder how universal that is). 0002 makes these changes for pg_popcount_sse42() and pg_popcount_masked_sse42(). It does seem strange to prefer a loop over pg_number_of_ones instead of using POPCNTQ when unaligned, but perhaps it's worth testing. I do recall the alignment stuff in the AVX-512 code showing benefits in tests because it avoids double-load overhead, so I think we should keep that for now. > 3) There is repeated code for the <8 bytes case, and the tail of the > "optimized" functions. I'm also not sure why the small case is inlined > everywhere. This is intended to help avoid function call and SIMD instruction overhead when it doesn't make sense to take it. I recall this showing a rather big difference in benchmarks when we were working on the AVX-512 versions. Regarding the duplicated code, sure, we could add some static inline functions or something. I think the only reason I haven't done so is because it's ~2 lines of code. -- nathan
Вложения
В списке pgsql-hackers по дате отправления: