Re: GIN improvements part 1: additional information

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: GIN improvements part 1: additional information
Дата
Msg-id CAPpHfdtRAxaq8mShtpd4mh5R0=5hP900mBJNU5TnyaeM44EEyA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: GIN improvements part 1: additional information  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы Re: GIN improvements part 1: additional information
Список pgsql-hackers
On Tue, Dec 10, 2013 at 12:26 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 12/09/2013 11:34 AM, Alexander Korotkov wrote:
On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas
<hlinnakangas@vmware.com>wrote:

Even if we use varbyte encoding, I wonder if it would be better to treat
block + offset number as a single 48-bit integer, rather than encode them
separately. That would allow the delta of two items on the same page to be
stored as a single byte, rather than two bytes. Naturally it would be a
loss on other values, but would be nice to see some kind of an analysis on
that. I suspect it might make the code simpler, too.

Yeah, I had that idea, but I thought it's not a better option. Will try to
do some analysis.

The more I think about that, the more convinced I am that it's a good idea. I don't think it will ever compress worse than the current approach of treating block and offset numbers separately, and, although I haven't actually tested it, I doubt it's any slower. About the same amount of arithmetic is required in both versions.

Attached is a version that does that. Plus some other minor cleanup.

(we should still investigate using a completely different algorithm, though)

Yes, when I though about that, I didn't realize that we can reserve less than 16 bits for offset number.
I rerun my benchmark and got following results:

         event         |     period      
-----------------------+-----------------
 index_build           | 00:01:46.39056
 index_build_recovery  | 00:00:05
 index_update          | 00:06:01.557708
 index_update_recovery | 00:01:23
 search_new            | 00:24:05.600366
 search_updated        | 00:25:29.520642
(6 rows)

     label      | blocks_mark 
----------------+-------------
 search_new     |   847509920
 search_updated |   883789826
(2 rows)

     label     |   size    
---------------+-----------
 new           | 364560384
 after_updates | 642736128
(2 rows)

Speed is same while index size is less. In previous format it was:

     label     |   size    
---------------+-----------
 new           | 419299328
 after_updates | 715915264
(2 rows)

Good optimization, thanks. I'll try another datasets but I expect similar results.
However, patch didn't apply to head. Corrected version is attached.

------
With best regards,
Alexander Korotkov.
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: ANALYZE sampling is too good
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: ANALYZE sampling is too good