Re: gistchoose vs. bloat

Поиск

Список

Период

Сортировка

От	Alexander Korotkov
Тема	Re: gistchoose vs. bloat
Дата	24 января 2013 г. 20:09:05
Msg-id	CAPpHfduF7taDg=eNgAVQtU6bGrRRnWDN1Z2YOENtmPTS1Do91w@mail.gmail.com обсуждение исходный текст
Ответ на	Re: gistchoose vs. bloat (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы	Re: gistchoose vs. bloat
Список	pgsql-hackers

Дерево обсуждения

On Thu, Jan 24, 2013 at 11:26 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 21.01.2013 15:19, Heikki Linnakangas wrote:
On 21.01.2013 15:06, Tom Lane wrote:
Jeff Davis<pgsql@j-davis.com> writes:
On Mon, 2013-01-21 at 00:48 -0500, Tom Lane wrote:
I looked at this patch. ISTM we should not have the option at all but
just do it always. I cannot believe that always-go-left is ever a
preferable strategy in the long run; the resulting imbalance in the
index will surely kill any possible benefit. Even if there are some
cases where it miraculously fails to lose, how many users are going to
realize that applies to their case and make use of the option?

Sounds good to me.

If I remember correctly, there was also an argument that it may be
useful for repeatable test results. That's a little questionable for
performance (except in those cases where few penalties are identical
anyway), but could plausibly be useful for a crash report or something.

Meh. There's already a random decision, in the equivalent place and for
a comparable reason, in btree (cf _bt_findinsertloc). Nobody's ever
complained about that being indeterminate, so I'm unconvinced that
there's a market for it with gist.

I wonder if it would work for gist to do something similar to
_bt_findinsertloc, and have a bias towards the left page, but sometimes
descend to one of the pages to the right. You would get the cache
locality of usually descending down the same subtree, but also fill the
pages to the right. Jeff / Alexander, want to give that a shot?

I did some experimenting with that. I used the same test case Alexander did, with geonames data, and compared unpatched version, the original patch, and the attached patch that biases the first "best" tuple found, but still sometimes chooses the other equally good ones.

testname | initsize | finalsize | idx_blks_read | idx_blks_hit
----------------+----------+-----------+---------------+--------------
patched-10-4mb | 75497472 | 90202112 | 5853604 | 10178331
unpatched-4mb | 75145216 | 94863360 | 5880676 | 10185647
unpatched-4mb | 75587584 | 97165312 | 5903107 | 10183759
patched-2-4mb | 74768384 | 81403904 | 5768124 | 10193738
origpatch-4mb | 74883072 | 82182144 | 5783412 | 10185373

All these tests were performed with shared_buffers=4MB, so that the index won't fit completely in shared buffers, and I could use the idx_blk_read/hit ratio as a measure of cache-friendliness. The "origpath" test was with a simplified version of Alexander's patch, see attached. It's the same as the original, but with the randomization-option and gist-build related code removed. The patched-10 and patched-2 tests are two variants with my patch, with 1/10 and 1/2 chance of moving to the next equal tuple, respectively. The differences in cache hit ratio might be just a result of different index sizes. I included two unpatched runs above to show that there's a fair amount of noise in these tests. That's because of the random nature of the test case; it picks rows to delete and insert at random.

I think the conclusion is that all of these patches are effective. The 1/10 variant is less effective, as expected, as it's closer in behavior to the unpatched behavior than the others. The 1/2 variant seems as good as the original patch.

Does two unpatched-4mb lines are different by coincidence? If so then variance is significant and we need more experiments to actually compare patched-2-4mb and origpatch-4mb lines.

There is another cause of overhead when use randomization in gistchoose: extra penalty calls. It could be significant when index fits to cache. In order evade it I especially change behaviour of my patch from "look sequentially and choose random" to "look in random order". I think we need to include comparison of CPU time.

------

With best regards,
Alexander Korotkov.

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: gistchoose vs. bloat