Re: Parallel CREATE INDEX for GIN indexes

Поиск

Список

Период

Сортировка

От	Tomas Vondra
Тема	Re: Parallel CREATE INDEX for GIN indexes
Дата	9 мая 2024 г. 13:13:16
Msg-id	5de12918-c1bc-4639-80c8-f06d2a8c95e3@enterprisedb.com обсуждение исходный текст
Ответ на	Re: Parallel CREATE INDEX for GIN indexes (Andy Fan <zhihuifan1213@163.com>)
Ответы	Re: Parallel CREATE INDEX for GIN indexes Re: Parallel CREATE INDEX for GIN indexes
Список	pgsql-hackers

Дерево обсуждения

On 5/9/24 11:44, Andy Fan wrote:
> 
> Hello Tomas,
> 
>>>> 2) v20240502-0002-Use-mergesort-in-the-leader-process.patch
>>>>
>>>> The approach implemented by 0001 works, but there's a little bit of
>>>> issue - if there are many distinct keys (e.g. for trigrams that can
>>>> happen very easily), the workers will hit the memory limit with only
>>>> very short TID lists for most keys. For serial build that means merging
>>>> the data into a lot of random places, and in parallel build it means the
>>>> leader will have to merge a lot of tiny lists from many sorted rows.
>>>>
>>>> Which can be quite annoying and expensive, because the leader does so
>>>> using qsort() in the serial part. It'd be better to ensure most of the
>>>> sorting happens in the workers, and the leader can do a mergesort. But
>>>> the mergesort must not happen too often - merging many small lists is
>>>> not cheaper than a single qsort (especially when the lists overlap).
>>>>
>>>> So this patch changes the workers to process the data in two phases. The
>>>> first works as before, but the data is flushed into a local tuplesort.
>>>> And then each workers sorts the results it produced, and combines them
>>>> into results with much larger TID lists, and those results are written
>>>> to the shared tuplesort. So the leader only gets very few lists to
>>>> combine for a given key - usually just one list per worker.
>>>
>>> Hmm, I was hoping we could implement the merging inside the tuplesort
>>> itself during its own flush phase, as it could save significantly on
>>> IO, and could help other users of tuplesort with deduplication, too.
>>>
>>
>> Would that happen in the worker or leader process? Because my goal was
>> to do the expensive part in the worker, because that's what helps with
>> the parallelization.
> 
> I guess both of you are talking about worker process, if here are
> something in my mind:
> 
> *btbuild* also let the WORKER dump the tuples into Sharedsort struct
> and let the LEADER merge them directly. I think this aim of this design
> is it is potential to save a mergeruns. In the current patch, worker dump
> to local tuplesort and mergeruns it and then leader run the merges
> again. I admit the goal of this patch is reasonable, but I'm feeling we
> need to adapt this way conditionally somehow. and if we find the way, we
> can apply it to btbuild as well. 
> 

I'm a bit confused about what you're proposing here, or how is that
related to what this patch is doing and/or to the what Matthias
mentioned in his e-mail from last week.

Let me explain the relevant part of the patch, and how I understand the
improvement suggested by Matthias. The patch does the work in three phases:

1) Worker gets data from table, split that into index items and add
those into a "private" tuplesort, and finally sorts that. So a worker
may see a key many times, with different TIDs, so the tuplesort may
contain many items for the same key, with distinct TID lists:

   key1: 1, 2, 3, 4
   key1: 5, 6, 7
   key1: 8, 9, 10
   key2: 1, 2, 3
   ...

2) Worker reads the sorted data, and combines TIDs for the same key into
larger TID lists, depending on work_mem etc. and writes the result into
a shared tuplesort. So the worker may write this:

   key1: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
   key2: 1, 2, 3

3) Leader reads this, combines TID lists from all workers (using a
higher memory limit, probably), and writes the result into the index.

The step (2) is optional - it would work without it. But it helps, as it
moves a potentially expensive sort into the workers (and thus the
parallel part of the build), and it also makes it cheaper, because in a
single worker the lists do not overlap and thus can be simply appended.
Which in the leader is not the case, forcing an expensive mergesort.

The trouble with (2) is that it "just copies" data from one tuplesort
into another, increasing the disk space requirements. In an extreme
case, when nothing can be combined, it pretty much doubles the amount of
disk space, and makes the build longer.

What I think Matthias is suggesting, is that this "TID list merging"
could be done directly as part of the tuplesort in step (1). So instead
of just moving the "sort tuples" from the appropriate runs, it could
also do an optional step of combining the tuples and writing this
combined tuple into the tuplesort result (for that worker).

Matthias also mentioned this might be useful when building btree indexes
with key deduplication.

AFAICS this might work, although it probably requires for the "combined"
tuple to be smaller than the sum of the combined tuples (in order to fit
into the space). But at least in the GIN build in the workers this is
likely true, because the TID lists do not overlap (and thus not hurting
the compressibility).

That being said, I still see this more as an optimization than something
required for the patch, and I don't think I'll have time to work on this
anytime soon. The patch is not extremely complex, but it's not trivial
either. But if someone wants to take a stab at extending tuplesort to
allow this, I won't object ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Parallel CREATE INDEX for GIN indexes