Re: Bug in new buffering GiST build code

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: Bug in new buffering GiST build code
Дата
Msg-id CAPpHfdvwEatx1+8ksdQey7HO=7eNxZ_TaQMJj4xaUB21p4jVWA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Bug in new buffering GiST build code  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Ответы Re: Bug in new buffering GiST build code  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Re: Bug in new buffering GiST build code  (Alexander Korotkov <aekorotkov@gmail.com>)
Список pgsql-hackers
On Sat, May 26, 2012 at 12:33 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
I think we should rewrite the way we track the parents completely. Rather than keep a path stack attached to every node buffer, let's just maintain a second hash table that contains the parent of every internal node. Whenever a downlink is moved to another page, update the hash table with the new information. That way we always have up-to-date information about the parent of every internal node.

That's much easier to understand than the path stack structures we have now. I think the overall memory consumption will be about the same too. Although we need the extra hash table with one entry for every internal node, we get rid of the path stack structs, which are also one per every internal node at the moment.

I believe it is faster too. I added some instrumentation to the existing gist code (with the additional getNodeBuffer() call added to fix this bug), to measure the time spent moving right, when refinding the parent of a page. I added gettimeofday() calls before and after moving right, and summed the total. In my test case, the final index size was about 19GB, and the index build took 3545 seconds (59 minutes). Of that time, 580 seconds (~ 10 minutes) was spent moving right to refind parents. That's a lot. I also printed a line whenever a refind operation had to move right 20 pages or more. That happened 2482 times during the build, in the worst case we moved right over 40000 pages.

Attached is a patch to replace the path stacks with a hash table. With this patch, the index build time in my test case dropped from 59 minutes to about 55 minutes. I don'ẗ know how representative or repeatable this test case is, but this definitely seems very worthwhile, not only because it fixes the bug and makes the code simpler, but also on performance grounds.

Cool, seems that we've both simplier and faster implementation of finding parent. Thanks!
 
Alexander, do you still have the test environments and data lying around that you used for GiST buffering testing last summer? Could you rerun some of those tests with this patch?

I think I can restore test environment and data. Will rerun tests soon.

------
With best regards,
Alexander Korotkov.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Euler Taveira
Дата:
Сообщение: Re: No, pg_size_pretty(numeric) was not such a hot idea
Следующее
От: Noah Misch
Дата:
Сообщение: Re: Synchronized scans versus relcache reinitialization