Re: Parallel CREATE INDEX for GIN indexes
От | Tomas Vondra |
---|---|
Тема | Re: Parallel CREATE INDEX for GIN indexes |
Дата | |
Msg-id | 7421a646-d938-4558-b001-85d217eb43a5@vondra.me обсуждение исходный текст |
Ответ на | Re: Parallel CREATE INDEX for GIN indexes (Tomas Vondra <tomas@vondra.me>) |
Ответы |
Re: Parallel CREATE INDEX for GIN indexes
|
Список | pgsql-hackers |
While working on the progress reporting, I've been looking into the performance results, particularly why the parallelism doesn't help much for some indexes - e.g. the index on the headers JSONB column. CREATE INDEX headers_jsonb_idx ON messages USING gin (msg_headers); In this case the parallelism helps only a little bit - serial build takes ~47 seconds, parallel builds with 1 worker (so 2 with leader) takes ~40 seconds. Not great. There are two reasons for this. First, the "keys" (JSONB values) are mostly unique, with only 1 or 2 TIDs per key, which means the workers can't really do much merging. But shifting the merges to workers is the main benefit of parallel builds - if the merge happens in the leader anyway, this explains the lack of speedup. The other reason is that with JSON keys the comparisons are rather expensive, and we're comparing a lot of keys. It occurred to me we can work around this by comparing hashes first, and comparing the full keys only when the hashes match. And indeed, this helps a lot (there's a very rough PoC patch attached) - I'm seeing ~20% speedup from this, so the parallel build runs in ~30 seconds now. Still not quite serial speedup, but better than before. But I think this optimization is mostly orthogonal to parallel builds, i.e. we could do the same thing for serial builds (while accumulating data in memory, we could do these comparisons). But it needs to be careful about still writing the data out in the "natural" order, not ordered by hash. The hash randomizes the pattern, making it much less efficient for bulk inserts (it trashes the buffers, etc.). The PoC patch for parallel builds addresses this by ignoring the hash during the final tuplesort, the serial builds would need to do something similar. My conclusion is this can be left as a future improvement, independent of the parallel builds. regards -- Tomas Vondra
Вложения
В списке pgsql-hackers по дате отправления: