On Tue, Nov 28, 2017 at 9:54 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Nov 28, 2017 at 9:50 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>> Would that address your concern? There would be an SQL interface, but
>>> it would be trivial.
>>
>> That's exactly what I think you should do, and mentioned so upthread.
>> A SQL interface can also show a good example of how developers can use
>> this API.
Attach revision, v5, adds a new test harness -- test_bloomfilter.
This can be used to experimentally verify that the meets the well
known "1% false positive rate with 9.6 bits per element" standard. It
manages to do exactly that:
postgres=# set client_min_messages = 'debug1';
SET
postgres=# SELECT test_bloomfilter(power => 23, nelements => 873813,
seed => -1, tests => 3);
DEBUG: beginning test #1...
DEBUG: bloom_work_mem (KB): 1024
DEBUG: false positives: 8630 (rate: 0.009876, proportion bits set:
0.517625, seed: 1373191603)
DEBUG: beginning test #2...
DEBUG: bloom_work_mem (KB): 1024
DEBUG: false positives: 8623 (rate: 0.009868, proportion bits set:
0.517623, seed: 406665822)
DEBUG: beginning test #3...
DEBUG: bloom_work_mem (KB): 1024
WARNING: false positives: 8840 (rate: 0.010117, proportion bits set:
0.517748, seed: 398116374)
test_bloomfilter
------------------
(1 row)
Here, we repeat the same test 3 times, varying only the seed value
used for each run.
The last message is a WARNING because we exceed the 1% threshold
(hard-coded into test_bloomfilter.c), though only by a tiny margin,
due only to random variations in seed value. We round up to 10 bits
per element for the regression tests. That's where the *actual*
"nelements" argument comes from within the tests, so pg_regress tests
should never see the WARNING (if they do, that counts as a failure).
I've experimentally observed that we get the 1% false positive rate
with any possible bitset when "nelements" works out at 9.6 bitset bits
per element. Inter-run variation is tiny. With 50 tests, I didn't
observe these same Bloom filter parameters produce a false positive
rate that came near to 1.1%. 1.01% or 1.02% was about as bad as it
got.
There is a fairly extensive README, which I hope will clear up the
theory behind the bloomfilter.c strategy on bitset size and false
positives. Also, there was a regression that I had to fix in
bloomfilter.c, in seeding. It didn't reliably cause variation in the
false positives. And, there was bitrot with the documentation that I
fixed up.
--
Peter Geoghegan