Обсуждение: Sparse bit set data structure

Поиск

Список

Период

Сортировка

Sparse bit set data structure

От

Heikki Linnakangas

Дата:

13 марта 2019 г., 22:18:12

Hi,

I was reviewing Andrey Borodin's patch for GiST VACUUM [1], which
includes a new "block set" data structure, to track internal and empty
pages while vacuuming a GiST. The blockset data structure was a pretty
simple radix tree, that can hold a set of BlockNumbers.

The memory usage of the radix tree would probably be good enough in real
life, as we also discussed on the thread. Nevertheless, I was somewhat
bothered by it, so I did some measurements. I added some
MemoryContextStats() calls to Andrey's test_blockset module, to print
out memory usage.

For storing 5000000 random 32-bit integers, or a density of about 1% of
bits set, the blockset consumed about 380 MB of memory. I think that's a
pretty realistic ratio of internal pages : leaf pages on a GiST index,
so I would hope the blockset to be efficient in that ballpark. However,
380 MB / 5000000 is about 76 bytes, so it's consuming about 76 bytes per
stored block number. That's a lot of overhead! For comparison, a plain
BlockNumber is just 4 bytes. With more sparse sets, it is even more
wasteful, on a per-item basis, although the total memory usage will of
course be smaller. (To be clear, no one is pushing around GiST indexes
with anywhere near 2^32 blocks, or 32 TB, but the per-BlockNumber stats
are nevertheless interesting.)

I started to consider rewriting the data structure into something more
like B-tree. Then I remembered that I wrote a data structure pretty much
like that last year already! We discussed that on the "Vacuum: allow
usage of more than 1GB of work mem" thread [2], to replace the current
huge array that holds the dead TIDs during vacuum.

So I dusted off that patch, and made it more general, so that it can be
used to store arbitrary 64-bit integers, rather than ItemPointers or
BlockNumbers. I then added a rudimentary form of compression to the leaf
pages, so that clusters of nearby values can be stored as an array of
32-bit integers, or as a bitmap. That would perhaps be overkill, if it
was just to conserve some memory in GiST vacuum, but I think this will
turn out to be a useful general-purpose facility.

I plugged this new "sparse bitset" implementation into the same
test_blockset test. The memory usage for 5000000 values is now just over
20 MB, or about 4.3 bytes per value. That's much more reasonable than
the 76 bytes.

I'll do some more performance testing on this, to make sure it performs
well enough on random lookups, to also replace VACUUM's dead item
pointer array. Assuming that works out, I plan to polish up and commit
this, and use it in the GiST vacuum. I'm also tempted to change VACUUM
to use this, since that should be pretty straightforward once we have
the data structure.

[1]
https://www.postgresql.org/message-id/A51F64E3-850D-4249-814E-54967103A859%40yandex-team.ru

[2]
https://www.postgresql.org/message-id/8e5cbf08-5dd8-466d-9271-562fc65f133f%40iki.fi

- Heikki

Вложения

Re: Sparse bit set data structure

От

Robert Haas

Дата:

13 марта 2019 г., 22:48:39

On Wed, Mar 13, 2019 at 3:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I started to consider rewriting the data structure into something more
> like B-tree. Then I remembered that I wrote a data structure pretty much
> like that last year already! We discussed that on the "Vacuum: allow
> usage of more than 1GB of work mem" thread [2], to replace the current
> huge array that holds the dead TIDs during vacuum.
>
> So I dusted off that patch, and made it more general, so that it can be
> used to store arbitrary 64-bit integers, rather than ItemPointers or
> BlockNumbers. I then added a rudimentary form of compression to the leaf
> pages, so that clusters of nearby values can be stored as an array of
> 32-bit integers, or as a bitmap. That would perhaps be overkill, if it
> was just to conserve some memory in GiST vacuum, but I think this will
> turn out to be a useful general-purpose facility.

Yeah, that sounds pretty cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Sparse bit set data structure

От

Andrey Borodin

Дата:

14 марта 2019 г., 08:15:58

Hi!

> 14 марта 2019 г., в 0:18, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
<0001-Add-SparseBitset-to-hold-a-large-set-of-64-bit-ints-.patch><0002-Andrey-Borodin-s-test_blockset-tool-adapted-for-Spar.patch>

That is very interesting idea. Basically, B-tree and radix tree is a tradeoff between space and time.

In general, lookup into radix tree will touch less CPU cache lines.
In this terms Bitmapset is on most performant and memory-wasteful side: lookup into Bitmapset is always 1 cache line.
Performance of radix tree can be good in case of skewed distribution, while B-tree will be OK on uniform. I think that
distributionof GiST inner pages is uniform, distribution of empty leafs is... I have no idea, let's consider uniform
too.

I'd review this data structure ASAP. I just need to understand: do we aim to v12 or v13? (I did not solve concurrency
issuesin GiST VACUUM yet, but will fix them at weekend) 

Also, maybe we should consider using RoaringBitmaps? [0]
As a side not I would add that while balanced trees are widely used for operations on external memory, there are more
performantversions for main memory. Like AVL-tree and RB-tree. 


Brest regards, Andrey Borodin.

[0] https://github.com/RoaringBitmap/CRoaring

Re: Sparse bit set data structure

От

Heikki Linnakangas

Дата:

14 марта 2019 г., 11:42:49

On 14/03/2019 07:15, Andrey Borodin wrote:
>> 14 марта 2019 г., в 0:18, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>>
<0001-Add-SparseBitset-to-hold-a-large-set-of-64-bit-ints-.patch><0002-Andrey-Borodin-s-test_blockset-tool-adapted-for-Spar.patch>
> 
> That is very interesting idea. Basically, B-tree and radix tree is a tradeoff between space and time.
> 
> In general, lookup into radix tree will touch less CPU cache lines.
> In this terms Bitmapset is on most performant and memory-wasteful side: lookup into Bitmapset is always 1 cache
line.
> Performance of radix tree can be good in case of skewed distribution, while B-tree will be OK on uniform. I think
thatdistribution of GiST inner pages is uniform, distribution of empty leafs is... I have no idea, let's consider
uniformtoo.

Yeah. In this implementation, the leaf nodes are packed into bitmaps 
when possible, so it should perform quite well on skewed distributions, too.

> I'd review this data structure ASAP. I just need to understand: do we aim to v12 or v13? (I did not solve concurrency
issuesin GiST VACUUM yet, but will fix them at weekend)

I'm aiming v12 with this. It's a fairly large patch, but it's very 
isolated. I think the most pressing issue is getting the rest of the 
GiST vacuum patch fixed. If you get that fixed over the weekend, I'll 
take another look at it on Monday.

> Also, maybe we should consider using RoaringBitmaps? [0]
> As a side not I would add that while balanced trees are widely used for operations on external memory, there are more
performantversions for main memory. Like AVL-tree and RB-tree.

Hmm. Yeah, this is quite similar to Roaring Bitmaps. Roaring bitmaps 
also have a top-level, at which you binary search, and "leaf" nodes 
which can be bitmaps or arrays. In a roaring bitmap, the key space is 
divided into fixed-size chunks, like in a radix tree, but different from 
a B-tree.

Even if we used AVL-trees or RB-trees or something else for the top 
layers of the tree, I think at the bottom level, we'd still need to use 
sorted arrays or bitmaps, to get the density we want. So I think the 
implementation at the leaf level would look pretty much the same, in any 
case. And the upper levels don't take very much space, regardless of the 
implementation. So I don't think it matters much.

- Heikki

Re: Sparse bit set data structure

От

Julien Rouhaud

Дата:

14 марта 2019 г., 18:37:16

On Wed, Mar 13, 2019 at 8:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I started to consider rewriting the data structure into something more
> like B-tree. Then I remembered that I wrote a data structure pretty much
> like that last year already! We discussed that on the "Vacuum: allow
> usage of more than 1GB of work mem" thread [2], to replace the current
> huge array that holds the dead TIDs during vacuum.
>
> So I dusted off that patch, and made it more general, so that it can be
> used to store arbitrary 64-bit integers, rather than ItemPointers or
> BlockNumbers. I then added a rudimentary form of compression to the leaf
> pages, so that clusters of nearby values can be stored as an array of
> 32-bit integers, or as a bitmap. That would perhaps be overkill, if it
> was just to conserve some memory in GiST vacuum, but I think this will
> turn out to be a useful general-purpose facility.

I had a quick look at it, so I thought first comments could be helpful.

+ * If you change this, you must recalculate MAX_INTERVAL_LEVELS, too!
+ *   MAX_INTERNAL_ITEMS ^ MAX_INTERNAL_LEVELS >= 2^64.

I think that MAX_INTERVAL_LEVELS was a typo for MAX_INTERNAL_LEVELS,
which has probably been renamed to MAX_TREE_LEVELS in this patch.

+ * with varying levels of "compression".  Which one is used depending on the
+ * values stored.

depends on?

+       if (newitem <= sbs->last_item)
+           elog(ERROR, "cannot insert to sparse bitset out of order");

Is there any reason to disallow inserting duplicates?  AFAICT nothing
prevents that in the current code.  If that's intended, that probably
should be documented.

Nothing struck me other than that, that's a pretty nice new lib :)

Re: Sparse bit set data structure

От

Julien Rouhaud

Дата:

15 марта 2019 г., 09:30:10

On Thu, Mar 14, 2019 at 4:37 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> +       if (newitem <= sbs->last_item)
> +           elog(ERROR, "cannot insert to sparse bitset out of order");
>
> Is there any reason to disallow inserting duplicates?  AFAICT nothing
> prevents that in the current code.  If that's intended, that probably
> should be documented.

That of course won't work well with SBS_LEAF_BITMAP.  I'd still prefer
a more explicit error message.

Re: Sparse bit set data structure

От

Heikki Linnakangas

Дата:

20 марта 2019 г., 04:10:51

On 14/03/2019 17:37, Julien Rouhaud wrote:
> On Wed, Mar 13, 2019 at 8:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> I started to consider rewriting the data structure into something more
>> like B-tree. Then I remembered that I wrote a data structure pretty much
>> like that last year already! We discussed that on the "Vacuum: allow
>> usage of more than 1GB of work mem" thread [2], to replace the current
>> huge array that holds the dead TIDs during vacuum.
>>
>> So I dusted off that patch, and made it more general, so that it can be
>> used to store arbitrary 64-bit integers, rather than ItemPointers or
>> BlockNumbers. I then added a rudimentary form of compression to the leaf
>> pages, so that clusters of nearby values can be stored as an array of
>> 32-bit integers, or as a bitmap. That would perhaps be overkill, if it
>> was just to conserve some memory in GiST vacuum, but I think this will
>> turn out to be a useful general-purpose facility.
> 
> I had a quick look at it, so I thought first comments could be helpful.

Thanks!

> +       if (newitem <= sbs->last_item)
> +           elog(ERROR, "cannot insert to sparse bitset out of order");
> 
> Is there any reason to disallow inserting duplicates?  AFAICT nothing
> prevents that in the current code.  If that's intended, that probably
> should be documented.

Yeah, we could easily allow setting the last item again. It would be a 
no-op, though, which doesn't seem very useful. It would be useful to 
lift the limitation that the values have to be added in ascending order, 
but current users that we're thinking of don't need it. Let's add that 
later, if the need arises.

Or did you mean that the structure would be a "bag" rather than a "set", 
so that it would keep the duplicates? I don't think that would be good. 
I guess the vacuum code that this will be used in wouldn't care either 
way, but "set" seems like a more clean concept.

On 13/03/2019 21:18, I wrote:
> I'll do some more performance testing on this, to make sure it performs
> well enough on random lookups, to also replace VACUUM's dead item
> pointer array.

Turns out, it didn't perform very well for that use case. I tested with 
distributions where you have clusters of 1-200 integers, at 2^16 
intervals. That's close to the distribution of ItemPointers in a VACUUM, 
where you have 1-200 (dead) items per page, and the offset number is 
stored in the low 16 bits.  It used slightly less memory than the plain 
array of ItemPointers that we use today, but the code to use a bitmap at 
the leaf level hardly ever kicks in, because there just isn't ever 
enough set bits for that to win. In order to get the dense packing, it 
needs to be done at a much more fine-grained fashion.

So I rewrote the way the leaf nodes work, so that the leaf nodes no 
longer use a bitmap, but a simple array of items, like on internal 
nodes. To still get the dense packing, the leaf items are packed using 
an algorithm called Simple-8b, which can encode between 1-240 integers 
in a single 64-bit word, depending on how far the integers are from each 
other. That works much better, and actually makes the code simpler, too.

I renamed this thing to IntegerSet. That seems like a more accurate name 
than the "sparse bitset" that I used call it. There aren't any "bits" 
visible in the public interface of this, after all.

I improved the regression tests, so that it tests all the interface 
functions, and covers various corner-cases. It tests the set with 
different patterns of integers, and it can print the memory usage and 
execution times of adding values to the set, probing random values, and 
iterating through the set. That is a useful micro-benchmark. The speed 
of all the operations seem to be in the same ballpark as with a simple 
sorted array, but it uses much less memory.

I'm now pretty satisfied with this. Barring objections, I'll commit this 
in the next few days. Please review, if you have a chance.

- Heikki

Вложения

0001-Add-IntegerSet-to-hold-large-sets-of-64-bit-ints-eff.patch

Re: Sparse bit set data structure

От

Andrey Borodin

Дата:

20 марта 2019 г., 05:48:00

Hi!

Great job!

> 20 марта 2019 г., в 9:10, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
>  Please review, if you have a chance.
>
> - Heikki
> <0001-Add-IntegerSet-to-hold-large-sets-of-64-bit-ints-eff.patch>

I'm looking into the code and have few questions:
1. I'm not sure it is the best interface for iteration
uint64
intset_iterate_next(IntegerSet *intset, bool *found)

we will use it like

while
{
    bool found;
    BlockNumber x = (BlockNumber) intset_iterate_next(is, &found);
    if (!found)
        break;
    // do stuff
}

we could use it like

BlockNumber x;
while(intset_iterate_next(is, &x))
{
    // do stuff
}

But that's not a big difference.


2.
 * Limitations
 * -----------
 *
 * - Values must be added in order.  (Random insertions would require
 *   splitting nodes, which hasn't been implemented.)
 *
 * - Values cannot be added while iteration is in progress.

You check for violation of these limitation in code, but there is not tests for this checks.
Should we add these tests?

Best regards, Andrey Borodin.

Re: Sparse bit set data structure

От

Julien Rouhaud

Дата:

20 марта 2019 г., 19:20:11

On Wed, Mar 20, 2019 at 2:10 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 14/03/2019 17:37, Julien Rouhaud wrote:
>
> > +       if (newitem <= sbs->last_item)
> > +           elog(ERROR, "cannot insert to sparse bitset out of order");
> >
> > Is there any reason to disallow inserting duplicates?  AFAICT nothing
> > prevents that in the current code.  If that's intended, that probably
> > should be documented.
>
> Yeah, we could easily allow setting the last item again. It would be a
> no-op, though, which doesn't seem very useful. It would be useful to
> lift the limitation that the values have to be added in ascending order,
> but current users that we're thinking of don't need it. Let's add that
> later, if the need arises.
>
> Or did you mean that the structure would be a "bag" rather than a "set",
> so that it would keep the duplicates? I don't think that would be good.
> I guess the vacuum code that this will be used in wouldn't care either
> way, but "set" seems like a more clean concept.

Yes, I was thinking about "bag".  For a set, allowing inserting
duplicates is indeed a no-op and should be pretty cheap with almost no
extra code for that.  Maybe VACUUM can't have duplicate, but is it
that unlikely that other would need it?  I'm wondering if just
requiring to merge multiple such structure isn't going to be needed
soon for instance.  If that's not wanted, I'm still thinking that a
less ambiguous error should be raised.

> I'm now pretty satisfied with this. Barring objections, I'll commit this
> in the next few days. Please review, if you have a chance.

You're defining SIMPLE8B_MAX_VALUE but never use it.  Maybe you wanted
to add an assert / explicit test in intset_add_member()?

/*
 * We buffer insertions in a simple array, before packing and inserting them
 * into the B-tree.  MAX_BUFFERED_VALUES sets the size of the buffer.  The
 * encoder assumes that it is large enough, that we can always fill a leaf
 * item with buffered new items.  In other words, MAX_BUFFERED_VALUES must be
 * larger than MAX_VALUES_PER_LEAF_ITEM.
 */
#define MAX_BUFFERED_VALUES            (MAX_VALUES_PER_LEAF_ITEM * 2)

The *2 is not immediately obvious here (at least it wasn't to me),
maybe explaining intset_flush_buffered_values() main loop rationale
here could be worthwhile.

Otherwise, everything looks just fine!

Re: Sparse bit set data structure

От

Julien Rouhaud

Дата:

20 марта 2019 г., 20:50:04

On Wed, Mar 20, 2019 at 5:20 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Wed, Mar 20, 2019 at 2:10 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> > I'm now pretty satisfied with this. Barring objections, I'll commit this
> > in the next few days. Please review, if you have a chance.
>
> You're defining SIMPLE8B_MAX_VALUE but never use it.  Maybe you wanted
> to add an assert / explicit test in intset_add_member()?
>
> /*
>  * We buffer insertions in a simple array, before packing and inserting them
>  * into the B-tree.  MAX_BUFFERED_VALUES sets the size of the buffer.  The
>  * encoder assumes that it is large enough, that we can always fill a leaf
>  * item with buffered new items.  In other words, MAX_BUFFERED_VALUES must be
>  * larger than MAX_VALUES_PER_LEAF_ITEM.
>  */
> #define MAX_BUFFERED_VALUES            (MAX_VALUES_PER_LEAF_ITEM * 2)
>
> The *2 is not immediately obvious here (at least it wasn't to me),
> maybe explaining intset_flush_buffered_values() main loop rationale
> here could be worthwhile.
>
> Otherwise, everything looks just fine!

I forgot to mention a minor gripe about the intset_binsrch_uint64 /
intset_binsrch_leaf function, which are 99% duplicates.  But I don't
know if fixing that (something like passing the array as a void * and
passing a getter function?) is worth the trouble.

Re: Sparse bit set data structure

От

Heikki Linnakangas

Дата:

22 марта 2019 г., 14:35:16

Committed, thanks for the review!

I made one last-minute change: Instead of allocating a memory context in 
intset_create(), it is left to the caller. The GiST vacuum code created 
two integer sets, and it made more sense for it to use the same context 
for both. Also, the VACUUM tid list patch would like to use a memory 
context with very large defaults, so that malloc() will decide to 
mmap() it, so that the memory can be returned to the OS. Because the 
desired memory allocation varies between callers, so it's best to leave 
it to the caller.

On 20/03/2019 19:50, Julien Rouhaud wrote:
> I forgot to mention a minor gripe about the intset_binsrch_uint64 /
> intset_binsrch_leaf function, which are 99% duplicates.  But I don't
> know if fixing that (something like passing the array as a void * and
> passing a getter function?) is worth the trouble.

Yeah. I felt that it's simpler to have two almost-identical functions, 
even though it's a bit repetitive, than try to have one function serve 
both uses. The 'nextkey' argument is always passed as false to 
intset_binsrch_leaf() function, but I kept it anyway, to keep the 
functions identical. The compiler will surely optimize it away, so it 
makes no difference to performance.

On 20/03/2019 04:48, Andrey Borodin wrote:
> 1. I'm not sure it is the best interface for iteration
>
> uint64
> intset_iterate_next(IntegerSet *intset, bool *found)
> 
> we will use it like
> 
> while
> {
>      bool found;
>      BlockNumber x = (BlockNumber) intset_iterate_next(is, &found);
>      if (!found)
>          break;
>      // do stuff
> }
> 
> we could use it like
> 
> BlockNumber x;
> while(intset_iterate_next(is, &x))
> {
>      // do stuff
> }
> 
> But that's not a big difference.

Agreed, I changed it that way.

- Heikki

Re: Sparse bit set data structure

От

Adrien NAYRAT

Дата:

28 марта 2019 г., 17:46:03

Hello,

According to the draw and simple8b_mode struct comment, it seems there 
is a typo:

> *      20-bit integer       20-bit integer       20-bit integer
>  * 1101 00000000000000010010 01111010000100100000 00000000000000010100
>  * ^
>  * selector
>  *
>  * The selector 1101 is 13 in decimal.  From the modes table below, we see
>  * that it means that the codeword encodes three 12-bit integers.  In decimal,
>  * those integers are 18, 500000 and 20.  Because we encode deltas rather than
>  * absolute values, the actual values that they represent are 18, 500018 and
>  * 500038.
[...]
>     {20, 3},                    /* mode 13: three 20-bit integers */


The comment should be "the codeword encodes three *20-bit* integers" ?

Patch attached.

Regards,

Вложения

fix-typo.patch

Re: Sparse bit set data structure

От

Bruce Momjian

Дата:

09 апреля 2019 г., 02:38:11

Uh, should this be applied?

---------------------------------------------------------------------------

On Thu, Mar 28, 2019 at 03:46:03PM +0100, Adrien NAYRAT wrote:
> Hello,
> 
> According to the draw and simple8b_mode struct comment, it seems there is a
> typo:
> 
> > *      20-bit integer       20-bit integer       20-bit integer
> >  * 1101 00000000000000010010 01111010000100100000 00000000000000010100
> >  * ^
> >  * selector
> >  *
> >  * The selector 1101 is 13 in decimal.  From the modes table below, we see
> >  * that it means that the codeword encodes three 12-bit integers.  In decimal,
> >  * those integers are 18, 500000 and 20.  Because we encode deltas rather than
> >  * absolute values, the actual values that they represent are 18, 500018 and
> >  * 500038.
> [...]
> >     {20, 3},                    /* mode 13: three 20-bit integers */
> 
> 
> The comment should be "the codeword encodes three *20-bit* integers" ?
> 
> Patch attached.
> 
> Regards,

> diff --git a/src/backend/lib/integerset.c b/src/backend/lib/integerset.c
> index 28b4a38609..9984fd55e8 100644
> --- a/src/backend/lib/integerset.c
> +++ b/src/backend/lib/integerset.c
> @@ -805,7 +805,7 @@ intset_binsrch_leaf(uint64 item, leaf_item *arr, int arr_elems, bool nextkey)
>   * selector
>   *
>   * The selector 1101 is 13 in decimal.  From the modes table below, we see
> - * that it means that the codeword encodes three 12-bit integers.  In decimal,
> + * that it means that the codeword encodes three 20-bit integers.  In decimal,
>   * those integers are 18, 500000 and 20.  Because we encode deltas rather than
>   * absolute values, the actual values that they represent are 18, 500018 and
>   * 500038.


-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: Sparse bit set data structure

От

Alvaro Herrera

Дата:

09 апреля 2019 г., 03:45:33

On 2019-Apr-08, Bruce Momjian wrote:

> Uh, should this be applied?

Yes, it's a pretty obvious typo methinks.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Sparse bit set data structure

От

Heikki Linnakangas

Дата:

09 апреля 2019 г., 08:35:19

On 09/04/2019 03:45, Alvaro Herrera wrote:
> On 2019-Apr-08, Bruce Momjian wrote:
> 
>> Uh, should this be applied?
> 
> Yes, it's a pretty obvious typo methinks.

Pushed, thanks, and sorry for the delay.

- Heikki

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Sparse bit set data structure

Вложения

Вложения

Вложения