Обсуждение: Reduce build times of pg_trgm GIN indexes
Hi hackers, Attached is a series of patches which gradually reduce the time it takes to create GIN indexes. Most of the gains come from optimizing the trigram extraction code in pg_trgm. A few small optimizations apply to any GIN index operator class. The changes are motivated by the time it takes to create GIN indexes on large production tables, especially, on columns with long strings. Even with multiple parallel maintenance workers I've seen this taking hours. For testing purposes I've used two different data sets: 1. The l_comment column of the TPC-H SF 10 lineitem table. l_comment contains relatively short strings with a minimum of 10, a maximum of 43 and an average of ~27 characters. 2. The plots from a collection of movies from Wikipedia. The plots are much longer than l_comment, with a minimum of 15, a maximum of 36,773 and an average of ~2,165 characters. The CSV file can be downloaded here [1]. Testing both cases is important because a big part of the trigram extraction is spent on removing duplicates. The longer the string, the more duplicates are usually encountered. The script I used for testing is attached. I ran CREATE INDEX three times and took the fastest run. I'm getting the following results on my i9-13950HX dev laptop in release build: Data set | Patched (ms) | Master (ms) | Speedup --------------------|--------------|--------------|---------- movies(plot) | 3,409 | 10,311 | 3.02x lineitem(l_comment) | 161,569 | 256,986 | 1.59x The attached patches do the following: - v1-0001-Inline-ginCompareAttEntries.patch: Inline ginCompareAttEntries() which is very frequently called by the GIN code. - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke() instead of FunctionCall2Coll(). This saves a bunch of per-comparison setup code, such as calling InitFunctionCallInfoData(). - v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of qsort(), to inline calls to the sort comparator. This is an interim step that is further improved on by patch 0006. - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch ginExtractEntries() deduplicates and sorts the entries returned from the extract value function. In case of pg_trgm, that is completely redundant because the trigrams are already deduplicated and sorted. The current version of this patch is just to demonstrate the potential. We need to think about what we want here. Ideally, we would require the extraction function to provide the entries deduplicated and sorted. Alternatively, we could indicate to ginExtractEntries() if the entries are already deduplicated and sorted. If we don't want to alter the signature of the extract value function, we could e.g. use the MSB of the nentries argument. - v1-0005-Make-btint4cmp-branchless.patch: Removes branches from btint4cmp(), which is heavily called from the GIN code. This might as well have benefit in other parts of the code base. v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort() with radix sort. For the purpose of demonstrating the possible gains, I've only replaced the signed variant for now. I've also tried using simplehash.h for deduplicating followed by a sort_template.h-based sort. But that was slower. v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same optimization can be done in plenty of other places in our code base. Likely, in most of them the gains are insignificant. v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots of text is actually ASCII. Hence, we provide a fast path for this case which is exercised if the MSB of the current character is unset. With above changes, the majority of the runtime is now spent on inserting the trigrams into the GIN index via ginInsertBAEntry(). The code in master uses a red-black for further deduplication and sorting. Traversing the red-black tree and updating it is pretty slow. I haven't looked through all the code yet, but it seems to me that we would be better off replacing the red-black tree with a sort and/or a hash map. But I'll leave this as future work for now. [1] https://github.com/kiq005/movie-recommendation/raw/refs/heads/master/src/dataset/wiki_movie_plots_deduped.csv -- David Geier
Вложения
- v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch
- v1-0007-Faster-qunique-comparator.patch
- v1-0006-Use-radix-sort.patch
- v1-0005-Make-btint4cmp-branchless.patch
- v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
- v1-0003-Use-sort_template.h.patch
- v1-0002-Optimized-comparison-functions.patch
- v1-0001-Inline-ginCompareAttEntries.patch
- test_gin_optimizations.sql
On 05/01/2026 17:01, David Geier wrote:
> The script I used for testing is attached. I ran CREATE INDEX three
> times and took the fastest run. I'm getting the following results on my
> i9-13950HX dev laptop in release build:
>
> Data set | Patched (ms) | Master (ms) | Speedup
> --------------------|--------------|--------------|----------
> movies(plot) | 3,409 | 10,311 | 3.02x
> lineitem(l_comment) | 161,569 | 256,986 | 1.59x
>
Impressive speedup!
> The attached patches do the following:
>
> - v1-0001-Inline-ginCompareAttEntries.patch: Inline
> ginCompareAttEntries() which is very frequently called by the GIN code.
Looks good.
> - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
> instead of FunctionCall2Coll(). This saves a bunch of per-comparison
> setup code, such as calling InitFunctionCallInfoData().
You lose the check for NULL result with this. That's probably still
worth checking.
> - v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of
> qsort(), to inline calls to the sort comparator. This is an interim step
> that is further improved on by patch 0006.
ok
> - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
> ginExtractEntries() deduplicates and sorts the entries returned from the
> extract value function. In case of pg_trgm, that is completely redundant
> because the trigrams are already deduplicated and sorted. The current
> version of this patch is just to demonstrate the potential. We need to
> think about what we want here. Ideally, we would require the extraction
> function to provide the entries deduplicated and sorted. Alternatively,
> we could indicate to ginExtractEntries() if the entries are already
> deduplicated and sorted. If we don't want to alter the signature of the
> extract value function, we could e.g. use the MSB of the nentries argument.
Yeah, this seems wrong as it is. You're assuming that if the extract
function returns nullFlags == NULL, the array is already sorted and deduped.
> - v1-0005-Make-btint4cmp-branchless.patch: Removes branches from
> btint4cmp(), which is heavily called from the GIN code. This might as
> well have benefit in other parts of the code base.
Seems reasonable.
> v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort()
> with radix sort. For the purpose of demonstrating the possible gains,
> I've only replaced the signed variant for now. I've also tried using
> simplehash.h for deduplicating followed by a sort_template.h-based sort.
> But that was slower.
Ok.
> v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
> full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
> equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
> optimization can be done in plenty of other places in our code base.
> Likely, in most of them the gains are insignificant.
Makes sense. I'm a little disappointed the compiler won't do that
optimization for us..
Perhaps we should introduce a new qunique_eq() function with a different
callback signature:
/* like qunique(), but the callback function returns true/false rather
than int */
static inline size_t
qunique_eq(void *array, size_t elements, size_t width,
bool (*equal) (const void *, const void *))
> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
> of text is actually ASCII. Hence, we provide a fast path for this case
> which is exercised if the MSB of the current character is unset.
This uses pg_ascii_tolower() when for ASCII characters when built with
the IGNORECASE. I don't think that's correct, if the proper collation
would do something more complicated for than what pg_ascii_tolower() does.
Did you measure how big is the impact from each individual patch?
Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
they make any difference on their own.
- Heikki
On Tue, 6 Jan 2026 at 22:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 05/01/2026 17:01, David Geier wrote:
> > The script I used for testing is attached. I ran CREATE INDEX three
> > times and took the fastest run. I'm getting the following results on my
> > i9-13950HX dev laptop in release build:
> >
> > Data set | Patched (ms) | Master (ms) | Speedup
> > --------------------|--------------|--------------|----------
> > movies(plot) | 3,409 | 10,311 | 3.02x
> > lineitem(l_comment) | 161,569 | 256,986 | 1.59x
> >
>
> Impressive speedup!
>
> > The attached patches do the following:
> >
> > - v1-0001-Inline-ginCompareAttEntries.patch: Inline
> > ginCompareAttEntries() which is very frequently called by the GIN code.
>
> Looks good.
>
> > - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
> > instead of FunctionCall2Coll(). This saves a bunch of per-comparison
> > setup code, such as calling InitFunctionCallInfoData().
>
> You lose the check for NULL result with this. That's probably still
> worth checking.
>
> > - v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of
> > qsort(), to inline calls to the sort comparator. This is an interim step
> > that is further improved on by patch 0006.
>
> ok
>
> > - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
> > ginExtractEntries() deduplicates and sorts the entries returned from the
> > extract value function. In case of pg_trgm, that is completely redundant
> > because the trigrams are already deduplicated and sorted. The current
> > version of this patch is just to demonstrate the potential. We need to
> > think about what we want here. Ideally, we would require the extraction
> > function to provide the entries deduplicated and sorted. Alternatively,
> > we could indicate to ginExtractEntries() if the entries are already
> > deduplicated and sorted. If we don't want to alter the signature of the
> > extract value function, we could e.g. use the MSB of the nentries argument.
>
> Yeah, this seems wrong as it is. You're assuming that if the extract
> function returns nullFlags == NULL, the array is already sorted and deduped.
>
> > - v1-0005-Make-btint4cmp-branchless.patch: Removes branches from
> > btint4cmp(), which is heavily called from the GIN code. This might as
> > well have benefit in other parts of the code base.
>
> Seems reasonable.
>
> > v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort()
> > with radix sort. For the purpose of demonstrating the possible gains,
> > I've only replaced the signed variant for now. I've also tried using
> > simplehash.h for deduplicating followed by a sort_template.h-based sort.
> > But that was slower.
>
> Ok.
>
> > v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
> > full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
> > equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
> > optimization can be done in plenty of other places in our code base.
> > Likely, in most of them the gains are insignificant.
>
> Makes sense. I'm a little disappointed the compiler won't do that
> optimization for us..
>
> Perhaps we should introduce a new qunique_eq() function with a different
> callback signature:
>
> /* like qunique(), but the callback function returns true/false rather
> than int */
> static inline size_t
> qunique_eq(void *array, size_t elements, size_t width,
> bool (*equal) (const void *, const void *))
>
> > v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
> > of text is actually ASCII. Hence, we provide a fast path for this case
> > which is exercised if the MSB of the current character is unset.
>
> This uses pg_ascii_tolower() when for ASCII characters when built with
> the IGNORECASE. I don't think that's correct, if the proper collation
> would do something more complicated for than what pg_ascii_tolower() does.
>
> Did you measure how big is the impact from each individual patch?
> Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
> they make any difference on their own.
>
> - Heikki
>
>
>
Hi!
patches 0003, 0004 & 0008 applies with whitespace errors.
reshke@yezzey-cbdb-bench:~/pgpure$ git am v1-0003-Use-sort_template.h.patch
Applying: Use sort_template.h
.git/rebase-apply/patch:66: trailing whitespace.
warning: 1 line adds whitespace errors.
reshke@yezzey-cbdb-bench:~/pgpure$ git am
v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
Applying: Avoid dedup and sort in ginExtractEntries
.git/rebase-apply/patch:30: trailing whitespace.
{
warning: 1 line adds whitespace errors.
reshke@yezzey-cbdb-bench:~/pgpure$ git am
v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch
Applying: Add ASCII fastpath to generate_trgm_only()
.git/rebase-apply/patch:101: trailing whitespace.
else
warning: 1 line adds whitespace errors.
I did run benchmarks of my VM using your data. v1-0001 solely improves
runtime by 4-5%. v2-0002 does not show runtime improvement for me.
With parallel GIN build, performance gains are 2-3 % for 2 workers and
not noticeable after that.
I will try to run some more benchmarks on this.
--
Best regards,
Kirill Reshke
Hi Heikki!
Thanks for looking at the patch set.
On 06.01.2026 18:00, Heikki Linnakangas wrote:
> On 05/01/2026 17:01, David Geier wrote:
>> - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
>> instead of FunctionCall2Coll(). This saves a bunch of per-comparison
>> setup code, such as calling InitFunctionCallInfoData().
>
> You lose the check for NULL result with this. That's probably still
> worth checking.
It seems like existing code where all args are not null, has that safety
check. Added it for consistency.
>> - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
>> ginExtractEntries() deduplicates and sorts the entries returned from the
>> extract value function. In case of pg_trgm, that is completely redundant
>> because the trigrams are already deduplicated and sorted. The current
>> version of this patch is just to demonstrate the potential. We need to
>> think about what we want here. Ideally, we would require the extraction
>> function to provide the entries deduplicated and sorted. Alternatively,
>> we could indicate to ginExtractEntries() if the entries are already
>> deduplicated and sorted. If we don't want to alter the signature of the
>> extract value function, we could e.g. use the MSB of the nentries
>> argument.
>
> Yeah, this seems wrong as it is. You're assuming that if the extract
> function returns nullFlags == NULL, the array is already sorted and
> deduped.
As said, that was just for demonstration purposes of the possible gains.
I've changed the code now such that the extractValue function of the GIN
index can indicate via the third argument uniqueAndSorted, if the
returned keys are already unique and sorted.
Unfortunately, it seems like this patch regresses performance. See
measurements below. I haven't had the time to investigate why that is.
It's pretty counter intuitive, given that this patch effectively only
removes code. Maybe you could re-test patch 0004 and share your runtimes?
>> v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
>> full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
>> equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
>> optimization can be done in plenty of other places in our code base.
>> Likely, in most of them the gains are insignificant.
>
> Makes sense. I'm a little disappointed the compiler won't do that
> optimization for us..
I thought the same.
>
> Perhaps we should introduce a new qunique_eq() function with a different
> callback signature:
>
> /* like qunique(), but the callback function returns true/false rather
> than int */
> static inline size_t
> qunique_eq(void *array, size_t elements, size_t width,
> bool (*equal) (const void *, const void *))
>
I would prefer to change qunique() instead. That would enforce using an
adequate comparison function from the get go. There are only ~15 calls
to qunique(). So refactoring this should also be a fairly small patch. I
can do that if there's agreement for that approach.
>> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
>> of text is actually ASCII. Hence, we provide a fast path for this case
>> which is exercised if the MSB of the current character is unset.
>
> This uses pg_ascii_tolower() when for ASCII characters when built with
> the IGNORECASE. I don't think that's correct, if the proper collation
> would do something more complicated for than what pg_ascii_tolower() does.
Oh, that's evil. I had tested that specifically. But it only worked
because the code in master uses str_tolower() with
DEFAULT_COLLATION_OID. So using a different locale like in the following
example does something different than when creating a database with the
same locale.
postgres=# select lower('III' COLLATE "tr_TR");
lower
-------
ııı
postgres=# select show_trgm('III' COLLATE "tr_TR");
show_trgm
-------------------------
{" i"," ii","ii ",iii}
(1 row)
But when using tr_TR as default locale of the database the following
happens:
postgres=# select lower('III' COLLATE "tr_TR");
lower
-------
ııı
postgres=# select show_trgm('III');sü
show_trgm
---------------------------------------
{0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
I'm wondering if that's intentional to begin with. Shouldn't the code
instead pass PG_GET_COLLATION() to str_tolower()? Might require some
research to see how other index types handle locales.
Coming back to the original problem: the lengthy comment at the top of
pg_locale_libc.c, suggests that in some cases ASCII characters are
handled the pg_ascii_tolower() way for the default locale. See for
example tolower_libc_mb(). So a character by character conversion using
that function will yield a different result than strlower_libc_mb(). I'm
wondering why that is.
Anyways, we could limit the optimization to only kick in when the used
locale follows the same rules as pg_ascii_tolower(). We could test that
when creating the locale and store that info in pg_locale_struct.
Thoughts?
> Did you measure how big is the impact from each individual patch?
> Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
> they make any difference on their own.
Here is the impact of each patch. I ran again CREATE INDEX three times
and took the fastest run. The run of each patch includes all previous
patches as well. For example, the timings for patch 0003 were measured
with a binary that also had patch 0002 and 0001 applied. To get the
impact of each patch in isolation, the delta to the previous run was taken.
Code | movies |delta | lineitem | delta
------------------------------------|--------|-------|------------------
master | 10,311 | 0 | 256,986 | 0
v1-0001-Inline-ginCompareAttEntries | 9,694 | 617 | 239,778 | 17,208
v1-0002-Optimized-comparison-func | 9,510 | 184 | 238,094 | 1,684
v1-0003-Use-sort_template.h | 8,661 | 849 | 231,190 | 6,904
v1-0004-Avoid-dedup-and-sort-in | 9,305 | -644 | 232,472 | -1,282
v1-0005-Make-btint4cmp-branchless | 8,240 | 1,065 | 228,387 | 4,085
v1-0006-Use-radix-sort | 6,976 | 1,264 | 207,687 | 20,700
v1-0007-Faster-qunique-comparator | 5,911 | 1,065 | 203,744 | 3,943
v1-0008-Add-ASCII-fastpath | 3,409 | 2,502 | 161,469 | 42,275
Attached is v2 of the patch set with the aforementioned changes. I've
also fixed the white space errors in 0003, 0004 and 0008, as reported by
Kirill.
--
David Geier
Вложения
- v2-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch
- v2-0007-Faster-qunique-comparator.patch
- v2-0006-Use-radix-sort.patch
- v2-0005-Make-btint4cmp-branchless.patch
- v2-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
- v2-0003-Use-sort_template.h.patch
- v2-0002-Optimized-comparison-functions.patch
- v2-0001-Inline-ginCompareAttEntries.patch
Hi! > Hi! > patches 0003, 0004 & 0008 applies with whitespace errors. > I've fixed the white space errors. See v2 of the patch set in [1]. > I did run benchmarks of my VM using your data. v1-0001 solely improves > runtime by 4-5%. v2-0002 does not show runtime improvement for me. > With parallel GIN build, performance gains are 2-3 % for 2 workers and > not noticeable after that. > > I will try to run some more benchmarks on this. Thanks. I've also included the delta for each patch in [1]. I would be curious what you measure, especially for patch 0004, where I curiously measured a regression. [1] https://www.postgresql.org/message-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d%40gmail.com -- David Geier