Re: [HACKERS] A design for amcheck heapam verification

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: [HACKERS] A design for amcheck heapam verification
Дата
Msg-id CAB7nPqT_oNzUKcrV_Pb7m7FZ2_D88++F1a0DnaDiW3PqxF-8-Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] A design for amcheck heapam verification  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: [HACKERS] A design for amcheck heapam verification  (Thomas Munro <thomas.munro@enterprisedb.com>)
Список pgsql-hackers
On Thu, Sep 28, 2017 at 3:32 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Sep 27, 2017 at 1:45 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> I have signed up as a reviewer of this patch, and I have looked at the
>> bloom filter implementation for now. This is the kind of facility that
>> people have asked for on this list for many years.
>>
>> One first thing striking me is that there is no test for this
>> implementation, which would be a base stone for other things, it would
>> be nice to validate that things are working properly before moving on
>> with 0002, and 0001 is a feature on its own. I don't think that it
>> would be complicated to have a small module in src/test/modules which
>> plugs in a couple of SQL functions on top of bloomfilter.h.
>
> 0001 is just a library utility. None of the backend lib utilities
> (like HyperLogLog, Discrete knapsack, etc) have dedicated test
> frameworks. Coding up an SQL interface for these things is a
> non-trivial project in itself.
>
> I have testing the implementation myself, but that was something that
> used C code.
>
> Can you tell me what you have in mind here in detail? For example,
> should there be a custom datatype that encapsulates the current state
> of the bloom filter? Or, should there be an aggregate function, that
> takes a containment argument that is tested at the end?

That could be something very simple:
- A function to initiate a bloom filter in a session context, with a
number of elements in input, which uses for example integers.
- A function to add an integer element to it.
- A function to query if an integer may exist or not.
- A function to free it.
The point is just to stress this code, I don't think that it is a
project this "large".

>> +#define MAX_HASH_FUNCS 10
>> Being able to define the number of hash functions used at
>> initialization would be nicer. Usually this number is kept way lower
>> than the number of elements to check as part of a set, but I see no
>> reason to not allow people to play with this API in a more extended
>> way. You can then let your module decide what it wants to use.
>
> The number of hash functions used is itself a function of the amount
> of memory available, and an estimate of the overall size of the set
> made ahead of time (in the case of amcheck, this is
> pg_class.reltuples). The typical interface is that the caller either
> specifies the amount of memory, or the required false positive rate
> (either way, they must also provide that estimate).

+   filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
The estimate is from wikipedia-sensei:
https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions
Being able to enforce that would be nice, not mandatory perhaps.

> The value of MAX_HASH_FUNCS, 10, was chosen based on the fact that we
> never actually use more than that many hash functions in practice,
> given the limitations on the total amount of memory you can use
> (128MB). The formula for determining the optimum number of hash
> functions is pretty standard stuff.

Hm... OK. That could be a default... Not really convinced though.

>> +/*
>> + * Hash function is taken from sdbm, a public-domain reimplementation of the
>> + * ndbm database library.
>> + */
>> Reference link?
>
> http://www.eecs.harvard.edu/margo/papers/usenix91/paper.pdf

That would be nicer if added in the code :)

> I'm probably going to end up using murmurhash32() instead of the sdbm
> hash function anyway, now that Andres has exposed it in a header file.
> This probably won't matter in the next version.

Yeah, that may be a good idea at the end.

>> +   bitset_bytes = bitset_bits / BITS_PER_BYTE;
>> +   filter->bitset = palloc0(bitset_bytes);
>> +   filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
>> +   filter->seed = seed;
>> I think that doing the allocation within the initialization phase is a
>> mistake. Being able to use a DSA would be nice, but I predict as well
>> cases where a module may want to have a bloom filter that persists as
>> well across multiple transactions, so the allocation should be able to
>> live across memory contexts.
>
> Why not just switch memory context before calling? Again, other
> comparable utilities don't have provide this in their interface.
>
> As for DSM, I think that that can come later, and can be written by
> somebody closer to that problem. There can be more than one
> initialization function.

I don't completely disagree with that, there could be multiple
initialization functions. Still, an advantage about designing things
right from the beginning with a set of correct APIs is that we don't
need extra things later and this will never bother module maintainers.
I would think that this utility interface should be minimal and
portable to maintain a long-term stance.

>> What I think you should do instead to
>> make this bloom implementation more modular is to let the caller give
>> a pointer to a memory area as well as its size. Then what bloom_init
>> should do is to just initialize this area of memory with zeros. This
>> approach would give a lot of freedom. Not linking a bloom definition
>> to work_mem would be nice as well.
>
> The implementation is probably always going to be bound in size by
> work_mem in practice, like tuplesort and tuplestore. I would say that
> that's a natural fit.

Hm...

>> +   hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);
>> I am wondering if it would make sense to be able to enforce the hash
>> function being used. The default does not look bad to me though, so we
>> could live with that.
>
> I prefer to keep things simple. I'm not aware of any use case that
> calls for the user to use a custom hash function. That said, I could
> believe that someone would want to use their own hash value for each
> bloom_add_element(), when they have one close at hand anyway -- much
> like addHyperLogLog(). Again, that seems like work for the ultimate
> consumer of that functionality. It's a trivial tweak that can happen
> later.

Yeah, perhaps most people would be satisfied with having only a default.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Bossart, Nathan"
Дата:
Сообщение: Re: [HACKERS] [Proposal] Allow users to specify multiple tables inVACUUM commands
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: [HACKERS] A design for amcheck heapam verification