introduce optimized linear search functions that return index of matching element

Поиск
Список
Период
Сортировка
От Nathan Bossart
Тема introduce optimized linear search functions that return index of matching element
Дата
Msg-id 20220917052903.GA3172400@nathanxps13
обсуждение исходный текст
Ответы Re: introduce optimized linear search functions that return index of matching element
Список pgsql-hackers
On Fri, Sep 16, 2022 at 02:54:14PM +0700, John Naylor wrote:
> v6 demonstrates why this should have been put off towards the end. (more below)

Since the SIMD code is fresh in my mind, I wanted to offer my review for
0001 in the "Improve dead tuple storage for lazy vacuum" thread [0].
However, I agree with John that the SIMD part of that work should be left
for the end, and I didn't want to distract from the radix tree part too
much.  So, here is a new thread for just the SIMD part.

>> I've updated the radix tree patch. It's now separated into two patches.
>>
>> 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
>> better names) that are similar to the pg_lfind8() family but they
>> return the index of the key in the vector instead of true/false. The
>> patch includes regression tests.

I don't think it's clear that the "lfind" functions return whether there is
a match while the "lsearch" functions return the index of the first match.
It might be better to call these something like "pg_lfind8_idx" and
"pg_lfind8_ge_idx" instead.

> +/*
> + * Return the index of the first element in the vector that is greater than
> + * or eual to the given scalar. Return sizeof(Vector8) if there is no such
> + * element.
>
> That's a bizarre API to indicate non-existence.

+1.  It should probably just return -1 in that case.

> + *
> + * Note that this function assumes the elements in the vector are sorted.
> + */
>
> That is *completely* unacceptable for a general-purpose function.

+1

> +#else /* USE_NO_SIMD */
> + Vector8 r = 0;
> + uint8 *rp = (uint8 *) &r;
> +
> + for (Size i = 0; i < sizeof(Vector8); i++)
> + rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;
>
> I don't think we should try to force the non-simd case to adopt the
> special semantics of vector comparisons. It's much easier to just use
> the same logic as the assert builds.

+1

> +#ifdef USE_SSE2
> + return (uint32) _mm_movemask_epi8(v);
> +#elif defined(USE_NEON)
> + static const uint8 mask[16] = {
> +        1 << 0, 1 << 1, 1 << 2, 1 << 3,
> +        1 << 4, 1 << 5, 1 << 6, 1 << 7,
> +        1 << 0, 1 << 1, 1 << 2, 1 << 3,
> +        1 << 4, 1 << 5, 1 << 6, 1 << 7,
> +      };
> +
> +    uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t)
> vshrq_n_s8(v, 7));
> +    uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
> +
> +    return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
>
> For Arm, we need to be careful here. This article goes into a lot of
> detail for this situation:
>
>
https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon

The technique demonstrated in this article seems to work nicely.

For these kinds of patches, I find the best way to review them is to try
out my proposed changes as I'm reading through the patch.  I hope you don't
mind that I've done so here and attached a new version of the patch.  In
addition to addressing the aforementioned feedback, I made the following
changes:

* I renamed the vector8_search_* functions to vector8_find() and
vector8_find_ge().  IMO this is more in the spirit of existing function
names like vector8_has().

* I simplified vector8_find_ge() by essentially making it do the opposite
of what vector8_has_le() does (i.e., using saturating subtraction to find
matching bytes).  This removes the need for vector8_min(), and since
vector8_find_ge() can just call vector8_search() to find any 0 bytes,
vector8_highbit_mask() can be removed as well.

* I simplified the test for pg_lfind8_ge_idx() by making it look a little
more like the test for pg_lfind32().  I wasn't sure about the use of rand()
and qsort(), and overall it just felt a little too complicated to me.

I've tested all three code paths (i.e., SSE2, Neon, and USE_NO_SIMD), but I
haven't done any performance analysis yet.

[0] https://postgr.es/m/CAD21AoD3w76wERs_Lq7_uA6%2BgTaoOERPji%2BYz8Ac6aui4JwvTg%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: ICU for global collation
Следующее
От: Marina Polyakova
Дата:
Сообщение: Re: ICU for global collation