Re: Parallel CREATE INDEX for GIN indexes

Поиск

Список

Период

Сортировка

От	Andy Fan
Тема	Re: Parallel CREATE INDEX for GIN indexes
Дата	2 июля 2024 г. 00:07:57
Msg-id	87o77gveoy.fsf@163.com обсуждение исходный текст
Ответ на	Re: Parallel CREATE INDEX for GIN indexes (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы	Re: Parallel CREATE INDEX for GIN indexes
Список	pgsql-hackers

Дерево обсуждения

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

Hi Tomas,

I am in a incompleted review process but I probably doesn't have much
time on this because of my internal tasks. So I just shared what I
did and the non-good-result patch.

What I tried to do is:
1) remove all the "sort" effort for the state->bs_sort_state tuples since
its input comes from state->bs_worker_state which is sorted already.

2). remove *partial* "sort" operations between accum.rbtree to
state->bs_worker_state because the tuple in accum.rbtree is sorted
already. 

Both of them can depend on the same API changes. 

1. 
struct Tuplesortstate
{
..
+ bool input_presorted; /*  the tuples are presorted. */
+ new_tapes;  // writes the tuples in memory into a new 'run'. 
}

and user can set it during tuplesort_begin_xx(, presorte=true);


2. in tuplesort_puttuple, if memory is full but presorted is
true, we can (a) avoid the sort. (b) resuse the existing 'runs'
to reduce the effort of 'mergeruns' unless new_tapes is set to
true. once it switch to a new tapes, the set state->new_tapes to false
and wait 3) to change it to true again.

3. tuplesort_dumptuples(..);  // dump the tuples in memory and set
new_tapes=true to tell the *this batch of input is presorted but they
are done, the next batch is just presort in its own batch*.

In the gin-parallel-build case,  for the case 1), we can just use

for tuple in bs_worker_sort: 
    tuplesort_putgintuple(state->bs_sortstate, ..);
tuplesort_dumptuples(..);

At last we can get a). only 1 run in the worker so that the leader can
have merge less runs in mergeruns.  b). reduce the sort both in
perform_sort_tuplesort and in sortstate_puttuple_common.

for the case 2). we can have:

   for tuple in RBTree.tuples:
         tuplesort_puttuples(tuple) ;  
      // this may cause a dumptuples internally when the memory is full,
      // but it is OK.
    tuplesort_dumptuples(..);

we can just remove the "sort" into sortstate_puttuple_common but
probably increase the 'runs' in sortstate which will increase the effort
of mergeruns at last.

But the test result is not good, maybe the 'sort' is not a key factor of
this. I do missed the perf step before doing this. or maybe my test data
is too small. 

Here is the patch I used for the above activity. and I used the
following sql to test. 

CREATE TABLE t(a int[], b numeric[]);

-- generate 1000 * 1000 rows.
insert into t select i, n
from normal_rand_array(1000, 90, 1::int4, 10000::int4) i,
normal_rand_array(1000, 90, '1.00233234'::numeric, '8.239241989134'::numeric) n;

alter table t set (parallel_workers=4);
set debug_parallel_query to on;
set max_parallel_maintenance_workers to 4;

create index on t using gin(a);
create index on t using gin(b);

I found normal_rand_array is handy to use in this case and I
register it into https://commitfest.postgresql.org/48/5061/.

Besides the above stuff, I didn't find anything wrong in the currrent
patch, and the above stuff can be categoried into "furture improvement"
even it is worthy to. 

-- 
Best Regards
Andy Fan

Вложения

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Parallel CREATE INDEX for GIN indexes

Вложения