gistchoose vs. bloat

Поиск

Список

Период

Сортировка

От	Alexander Korotkov
Тема	gistchoose vs. bloat
Дата	18 июня 2012 г. 11:13:12
Msg-id	CAPpHfduhRjJ3WQ3F2yac+UZ98XFoGW4iXfontLZYkBC+dqpFFQ@mail.gmail.com обсуждение исходный текст
Ответы	Re: gistchoose vs. bloat (Jeff Davis <pgsql@j-davis.com>)
Список	pgsql-hackers

Дерево обсуждения

Hackers,

While experimenting with gistchoose I achieve interesting results about relation of gistchoose behaviour and gist index bloat.

I've created following testcase for reproducing gist index bloating:

1) Create test table with 1000000 points from geonames

# create table geotest (id serial, point point);

# insert into geotest (point) (select * from geonames order by random() limit 1000000);

2) Create gist index and statistics table. Insert initial statistics (rows count, index size) into table.

# create index geotest_idx on geotest using gist (point) with (buffering = off);

# create table geotest_stats (id serial, count bigint, size bigint);

# insert into geotest_stats (count, size) values ((select count(*) from geotest), pg_relation_size('geotest_idx'));

3) Delete 10% of geotest and vacuum it.

delete from geotest where id in (select id from geotest order by random() limit 100000);

vacuum geotest;

4) Insert new 100000 points fromg eonames.

insert into geotest (point) (select * from geonames order by random() limit 100000);

5) Insert current statistics (rows count, index size) into table.

insert into geotest_stats (count, size) values ((select count(*) from geotest), pg_relation_size('geotest_idx'));

6) Repeat 3-5 steps 50 times.

I get following results with current gistchoose implementation:

test=# select * from geotest1_stats;

id | count | size

----+---------+-----------

1 | 1000000 | 75767808

2 | 1000000 | 76046336

3 | 1000000 | 76840960

4 | 1000000 | 77799424

5 | 1000000 | 78766080

6 | 1000000 | 79757312

7 | 1000000 | 80494592

8 | 1000000 | 81125376

9 | 1000000 | 81985536

10 | 1000000 | 82804736

11 | 1000000 | 83378176

12 | 1000000 | 84115456

13 | 1000000 | 84819968

14 | 1000000 | 85598208

15 | 1000000 | 86302720

16 | 1000000 | 87023616

17 | 1000000 | 87703552

18 | 1000000 | 88342528

19 | 1000000 | 88915968

20 | 1000000 | 89513984

21 | 1000000 | 90152960

22 | 1000000 | 90742784

23 | 1000000 | 91324416

24 | 1000000 | 91742208

25 | 1000000 | 92258304

26 | 1000000 | 92758016

27 | 1000000 | 93241344

28 | 1000000 | 93683712

29 | 1000000 | 93970432

30 | 1000000 | 94396416

31 | 1000000 | 94740480

32 | 1000000 | 95068160

33 | 1000000 | 95444992

34 | 1000000 | 95780864

35 | 1000000 | 96313344

36 | 1000000 | 96731136

37 | 1000000 | 96968704

38 | 1000000 | 97222656

39 | 1000000 | 97509376

40 | 1000000 | 97779712

41 | 1000000 | 98074624

42 | 1000000 | 98344960

43 | 1000000 | 98639872

44 | 1000000 | 99000320

45 | 1000000 | 99229696

46 | 1000000 | 99459072

47 | 1000000 | 99672064

48 | 1000000 | 99901440

49 | 1000000 | 100114432

50 | 1000000 | 100261888

51 | 1000000 | 100458496

(51 rows)

Index grow about from 75 MB to 100 MB.

Current implementation of gistchoose select first index tuple which have minimal penalty. It is possible for several tuples to have same minimal penalty. I've created simple patch which selects random from them. I then I've following results for same testcase.

test=# select * from geotest_stats;

id | count | size

----+---------+----------

1 | 1000000 | 74694656

2 | 1000000 | 74743808

3 | 1000000 | 74891264

4 | 1000000 | 75120640

5 | 1000000 | 75374592

6 | 1000000 | 75612160

7 | 1000000 | 75833344

8 | 1000000 | 76144640

9 | 1000000 | 76333056

10 | 1000000 | 76595200

11 | 1000000 | 76718080

12 | 1000000 | 76939264

13 | 1000000 | 77070336

14 | 1000000 | 77332480

15 | 1000000 | 77520896

16 | 1000000 | 77750272

17 | 1000000 | 77996032

18 | 1000000 | 78127104

19 | 1000000 | 78307328

20 | 1000000 | 78512128

21 | 1000000 | 78610432

22 | 1000000 | 78774272

23 | 1000000 | 78929920

24 | 1000000 | 79060992

25 | 1000000 | 79216640

26 | 1000000 | 79331328

27 | 1000000 | 79454208

28 | 1000000 | 79593472

29 | 1000000 | 79708160

30 | 1000000 | 79822848

31 | 1000000 | 79921152

32 | 1000000 | 80035840

33 | 1000000 | 80076800

34 | 1000000 | 80175104

35 | 1000000 | 80207872

36 | 1000000 | 80322560

37 | 1000000 | 80363520

38 | 1000000 | 80445440

39 | 1000000 | 80494592

40 | 1000000 | 80576512

41 | 1000000 | 80666624

42 | 1000000 | 80764928

43 | 1000000 | 80805888

44 | 1000000 | 80912384

45 | 1000000 | 80994304

46 | 1000000 | 81027072

47 | 1000000 | 81100800

48 | 1000000 | 81174528

49 | 1000000 | 81297408

50 | 1000000 | 81371136

51 | 1000000 | 81420288

(51 rows)

Now index grow about from 75 MB to 81 MB. It is almost 4 times less index bloat!

I've following explanation of that. If index tuples are overlapping then possibility of insertion into last tuple is much lower than possibility of insertion to the first tuple. Thereby, freed space underlying to last tuples of inner page is not likely to be reused. Selecting random tuple increase that chances.

Obviously, it makes insertion more expensive. I need some more benchmarks to measure it. Probably, we would need a user-visible option for cheaper insert or less bloat.

------
With best regards,
Alexander Korotkov.

Вложения

gist_choose_bloat-0.1.patch

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Heikki Linnakangas
Дата: 18 июня 2012 г., 10:59:23
Сообщение: Re: Resource Owner reassign Locks

Следующее

От: Andres Freund
Дата: 18 июня 2012 г., 11:30:42
Сообщение: Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

gistchoose vs. bloat

Вложения

Предыдущее

Следующее