Thanks for looking at this.
On Thu, 20 Dec 2018 at 23:56, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> I've checked for Clang 6, it turns out that indeed it generates popcnt without
> any macro, but only in one place for bloom_prop_bits_set. After looking at this
> function it seems that it would be benefitial to actually use popcnt there too.
Yeah, that's the pattern that's mentioned in
https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
It would need to be changed to call the popcount function. This
existing makes me a bit more worried that some extension could be
using a similar pattern and end up being compiled with -mpopcnt due to
pg_config having that CFLAG. That's all fine until the binary makes
it's way over to a machine without that instruction.
> > I am able to measure performance gains from the patch. In a 3.4GB
> > table containing a single column with just 10 statistics targets, I
> > got the following times after running ANALYZE on the table.
>
> I've tested it too a bit, and got similar results when the patched version is
> slightly faster. But then I wonder if popcnt is the best solution here, since
> after some short research I found a paper [1], where authors claim that:
>
> Maybe surprisingly, we show that a vectorized approach using SIMD
> instructions can be twice as fast as using the dedicated instructions on
> recent Intel processors.
>
>
> [1]: https://arxiv.org/pdf/1611.07612.pdf
I can't imagine that using the number_of_ones[] array processing
8-bits at a time would be slower than POPCNT though.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services