Even if we use varbyte encoding, I wonder if it would be better to treat block + offset number as a single 48-bit integer, rather than encode them separately. That would allow the delta of two items on the same page to be stored as a single byte, rather than two bytes. Naturally it would be a loss on other values, but would be nice to see some kind of an analysis on that. I suspect it might make the code simpler, too.
Yeah, I had that idea, but I thought it's not a better option. Will try to do some analysis.
The more I think about that, the more convinced I am that it's a good idea. I don't think it will ever compress worse than the current approach of treating block and offset numbers separately, and, although I haven't actually tested it, I doubt it's any slower. About the same amount of arithmetic is required in both versions.
Attached is a version that does that. Plus some other minor cleanup.
(we should still investigate using a completely different algorithm, though)
Yes, when I though about that, I didn't realize that we can reserve less than 16 bits for offset number.
I rerun my benchmark and got following results:
event | period
-----------------------+-----------------
index_build | 00:01:46.39056
index_build_recovery | 00:00:05
index_update | 00:06:01.557708
index_update_recovery | 00:01:23
search_new | 00:24:05.600366
search_updated | 00:25:29.520642
(6 rows)
label | blocks_mark
----------------+-------------
search_new | 847509920
search_updated | 883789826
(2 rows)
label | size
---------------+-----------
new | 364560384
after_updates | 642736128
(2 rows)
Speed is same while index size is less. In previous format it was:
label | size
---------------+-----------
new | 419299328
after_updates | 715915264
(2 rows)
Good optimization, thanks. I'll try another datasets but I expect similar results.
However, patch didn't apply to head. Corrected version is attached.