Обсуждение: Reduce build times of pg_trgm GIN indexes
Hi hackers, Attached is a series of patches which gradually reduce the time it takes to create GIN indexes. Most of the gains come from optimizing the trigram extraction code in pg_trgm. A few small optimizations apply to any GIN index operator class. The changes are motivated by the time it takes to create GIN indexes on large production tables, especially, on columns with long strings. Even with multiple parallel maintenance workers I've seen this taking hours. For testing purposes I've used two different data sets: 1. The l_comment column of the TPC-H SF 10 lineitem table. l_comment contains relatively short strings with a minimum of 10, a maximum of 43 and an average of ~27 characters. 2. The plots from a collection of movies from Wikipedia. The plots are much longer than l_comment, with a minimum of 15, a maximum of 36,773 and an average of ~2,165 characters. The CSV file can be downloaded here [1]. Testing both cases is important because a big part of the trigram extraction is spent on removing duplicates. The longer the string, the more duplicates are usually encountered. The script I used for testing is attached. I ran CREATE INDEX three times and took the fastest run. I'm getting the following results on my i9-13950HX dev laptop in release build: Data set | Patched (ms) | Master (ms) | Speedup --------------------|--------------|--------------|---------- movies(plot) | 3,409 | 10,311 | 3.02x lineitem(l_comment) | 161,569 | 256,986 | 1.59x The attached patches do the following: - v1-0001-Inline-ginCompareAttEntries.patch: Inline ginCompareAttEntries() which is very frequently called by the GIN code. - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke() instead of FunctionCall2Coll(). This saves a bunch of per-comparison setup code, such as calling InitFunctionCallInfoData(). - v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of qsort(), to inline calls to the sort comparator. This is an interim step that is further improved on by patch 0006. - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch ginExtractEntries() deduplicates and sorts the entries returned from the extract value function. In case of pg_trgm, that is completely redundant because the trigrams are already deduplicated and sorted. The current version of this patch is just to demonstrate the potential. We need to think about what we want here. Ideally, we would require the extraction function to provide the entries deduplicated and sorted. Alternatively, we could indicate to ginExtractEntries() if the entries are already deduplicated and sorted. If we don't want to alter the signature of the extract value function, we could e.g. use the MSB of the nentries argument. - v1-0005-Make-btint4cmp-branchless.patch: Removes branches from btint4cmp(), which is heavily called from the GIN code. This might as well have benefit in other parts of the code base. v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort() with radix sort. For the purpose of demonstrating the possible gains, I've only replaced the signed variant for now. I've also tried using simplehash.h for deduplicating followed by a sort_template.h-based sort. But that was slower. v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same optimization can be done in plenty of other places in our code base. Likely, in most of them the gains are insignificant. v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots of text is actually ASCII. Hence, we provide a fast path for this case which is exercised if the MSB of the current character is unset. With above changes, the majority of the runtime is now spent on inserting the trigrams into the GIN index via ginInsertBAEntry(). The code in master uses a red-black for further deduplication and sorting. Traversing the red-black tree and updating it is pretty slow. I haven't looked through all the code yet, but it seems to me that we would be better off replacing the red-black tree with a sort and/or a hash map. But I'll leave this as future work for now. [1] https://github.com/kiq005/movie-recommendation/raw/refs/heads/master/src/dataset/wiki_movie_plots_deduped.csv -- David Geier
Вложения
- v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch
- v1-0007-Faster-qunique-comparator.patch
- v1-0006-Use-radix-sort.patch
- v1-0005-Make-btint4cmp-branchless.patch
- v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
- v1-0003-Use-sort_template.h.patch
- v1-0002-Optimized-comparison-functions.patch
- v1-0001-Inline-ginCompareAttEntries.patch
- test_gin_optimizations.sql
On 05/01/2026 17:01, David Geier wrote:
> The script I used for testing is attached. I ran CREATE INDEX three
> times and took the fastest run. I'm getting the following results on my
> i9-13950HX dev laptop in release build:
>
> Data set | Patched (ms) | Master (ms) | Speedup
> --------------------|--------------|--------------|----------
> movies(plot) | 3,409 | 10,311 | 3.02x
> lineitem(l_comment) | 161,569 | 256,986 | 1.59x
>
Impressive speedup!
> The attached patches do the following:
>
> - v1-0001-Inline-ginCompareAttEntries.patch: Inline
> ginCompareAttEntries() which is very frequently called by the GIN code.
Looks good.
> - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
> instead of FunctionCall2Coll(). This saves a bunch of per-comparison
> setup code, such as calling InitFunctionCallInfoData().
You lose the check for NULL result with this. That's probably still
worth checking.
> - v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of
> qsort(), to inline calls to the sort comparator. This is an interim step
> that is further improved on by patch 0006.
ok
> - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
> ginExtractEntries() deduplicates and sorts the entries returned from the
> extract value function. In case of pg_trgm, that is completely redundant
> because the trigrams are already deduplicated and sorted. The current
> version of this patch is just to demonstrate the potential. We need to
> think about what we want here. Ideally, we would require the extraction
> function to provide the entries deduplicated and sorted. Alternatively,
> we could indicate to ginExtractEntries() if the entries are already
> deduplicated and sorted. If we don't want to alter the signature of the
> extract value function, we could e.g. use the MSB of the nentries argument.
Yeah, this seems wrong as it is. You're assuming that if the extract
function returns nullFlags == NULL, the array is already sorted and deduped.
> - v1-0005-Make-btint4cmp-branchless.patch: Removes branches from
> btint4cmp(), which is heavily called from the GIN code. This might as
> well have benefit in other parts of the code base.
Seems reasonable.
> v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort()
> with radix sort. For the purpose of demonstrating the possible gains,
> I've only replaced the signed variant for now. I've also tried using
> simplehash.h for deduplicating followed by a sort_template.h-based sort.
> But that was slower.
Ok.
> v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
> full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
> equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
> optimization can be done in plenty of other places in our code base.
> Likely, in most of them the gains are insignificant.
Makes sense. I'm a little disappointed the compiler won't do that
optimization for us..
Perhaps we should introduce a new qunique_eq() function with a different
callback signature:
/* like qunique(), but the callback function returns true/false rather
than int */
static inline size_t
qunique_eq(void *array, size_t elements, size_t width,
bool (*equal) (const void *, const void *))
> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
> of text is actually ASCII. Hence, we provide a fast path for this case
> which is exercised if the MSB of the current character is unset.
This uses pg_ascii_tolower() when for ASCII characters when built with
the IGNORECASE. I don't think that's correct, if the proper collation
would do something more complicated for than what pg_ascii_tolower() does.
Did you measure how big is the impact from each individual patch?
Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
they make any difference on their own.
- Heikki
On Tue, 6 Jan 2026 at 22:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 05/01/2026 17:01, David Geier wrote:
> > The script I used for testing is attached. I ran CREATE INDEX three
> > times and took the fastest run. I'm getting the following results on my
> > i9-13950HX dev laptop in release build:
> >
> > Data set | Patched (ms) | Master (ms) | Speedup
> > --------------------|--------------|--------------|----------
> > movies(plot) | 3,409 | 10,311 | 3.02x
> > lineitem(l_comment) | 161,569 | 256,986 | 1.59x
> >
>
> Impressive speedup!
>
> > The attached patches do the following:
> >
> > - v1-0001-Inline-ginCompareAttEntries.patch: Inline
> > ginCompareAttEntries() which is very frequently called by the GIN code.
>
> Looks good.
>
> > - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
> > instead of FunctionCall2Coll(). This saves a bunch of per-comparison
> > setup code, such as calling InitFunctionCallInfoData().
>
> You lose the check for NULL result with this. That's probably still
> worth checking.
>
> > - v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of
> > qsort(), to inline calls to the sort comparator. This is an interim step
> > that is further improved on by patch 0006.
>
> ok
>
> > - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
> > ginExtractEntries() deduplicates and sorts the entries returned from the
> > extract value function. In case of pg_trgm, that is completely redundant
> > because the trigrams are already deduplicated and sorted. The current
> > version of this patch is just to demonstrate the potential. We need to
> > think about what we want here. Ideally, we would require the extraction
> > function to provide the entries deduplicated and sorted. Alternatively,
> > we could indicate to ginExtractEntries() if the entries are already
> > deduplicated and sorted. If we don't want to alter the signature of the
> > extract value function, we could e.g. use the MSB of the nentries argument.
>
> Yeah, this seems wrong as it is. You're assuming that if the extract
> function returns nullFlags == NULL, the array is already sorted and deduped.
>
> > - v1-0005-Make-btint4cmp-branchless.patch: Removes branches from
> > btint4cmp(), which is heavily called from the GIN code. This might as
> > well have benefit in other parts of the code base.
>
> Seems reasonable.
>
> > v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort()
> > with radix sort. For the purpose of demonstrating the possible gains,
> > I've only replaced the signed variant for now. I've also tried using
> > simplehash.h for deduplicating followed by a sort_template.h-based sort.
> > But that was slower.
>
> Ok.
>
> > v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
> > full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
> > equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
> > optimization can be done in plenty of other places in our code base.
> > Likely, in most of them the gains are insignificant.
>
> Makes sense. I'm a little disappointed the compiler won't do that
> optimization for us..
>
> Perhaps we should introduce a new qunique_eq() function with a different
> callback signature:
>
> /* like qunique(), but the callback function returns true/false rather
> than int */
> static inline size_t
> qunique_eq(void *array, size_t elements, size_t width,
> bool (*equal) (const void *, const void *))
>
> > v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
> > of text is actually ASCII. Hence, we provide a fast path for this case
> > which is exercised if the MSB of the current character is unset.
>
> This uses pg_ascii_tolower() when for ASCII characters when built with
> the IGNORECASE. I don't think that's correct, if the proper collation
> would do something more complicated for than what pg_ascii_tolower() does.
>
> Did you measure how big is the impact from each individual patch?
> Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
> they make any difference on their own.
>
> - Heikki
>
>
>
Hi!
patches 0003, 0004 & 0008 applies with whitespace errors.
reshke@yezzey-cbdb-bench:~/pgpure$ git am v1-0003-Use-sort_template.h.patch
Applying: Use sort_template.h
.git/rebase-apply/patch:66: trailing whitespace.
warning: 1 line adds whitespace errors.
reshke@yezzey-cbdb-bench:~/pgpure$ git am
v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
Applying: Avoid dedup and sort in ginExtractEntries
.git/rebase-apply/patch:30: trailing whitespace.
{
warning: 1 line adds whitespace errors.
reshke@yezzey-cbdb-bench:~/pgpure$ git am
v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch
Applying: Add ASCII fastpath to generate_trgm_only()
.git/rebase-apply/patch:101: trailing whitespace.
else
warning: 1 line adds whitespace errors.
I did run benchmarks of my VM using your data. v1-0001 solely improves
runtime by 4-5%. v2-0002 does not show runtime improvement for me.
With parallel GIN build, performance gains are 2-3 % for 2 workers and
not noticeable after that.
I will try to run some more benchmarks on this.
--
Best regards,
Kirill Reshke
Hi Heikki!
Thanks for looking at the patch set.
On 06.01.2026 18:00, Heikki Linnakangas wrote:
> On 05/01/2026 17:01, David Geier wrote:
>> - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
>> instead of FunctionCall2Coll(). This saves a bunch of per-comparison
>> setup code, such as calling InitFunctionCallInfoData().
>
> You lose the check for NULL result with this. That's probably still
> worth checking.
It seems like existing code where all args are not null, has that safety
check. Added it for consistency.
>> - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
>> ginExtractEntries() deduplicates and sorts the entries returned from the
>> extract value function. In case of pg_trgm, that is completely redundant
>> because the trigrams are already deduplicated and sorted. The current
>> version of this patch is just to demonstrate the potential. We need to
>> think about what we want here. Ideally, we would require the extraction
>> function to provide the entries deduplicated and sorted. Alternatively,
>> we could indicate to ginExtractEntries() if the entries are already
>> deduplicated and sorted. If we don't want to alter the signature of the
>> extract value function, we could e.g. use the MSB of the nentries
>> argument.
>
> Yeah, this seems wrong as it is. You're assuming that if the extract
> function returns nullFlags == NULL, the array is already sorted and
> deduped.
As said, that was just for demonstration purposes of the possible gains.
I've changed the code now such that the extractValue function of the GIN
index can indicate via the third argument uniqueAndSorted, if the
returned keys are already unique and sorted.
Unfortunately, it seems like this patch regresses performance. See
measurements below. I haven't had the time to investigate why that is.
It's pretty counter intuitive, given that this patch effectively only
removes code. Maybe you could re-test patch 0004 and share your runtimes?
>> v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
>> full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
>> equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
>> optimization can be done in plenty of other places in our code base.
>> Likely, in most of them the gains are insignificant.
>
> Makes sense. I'm a little disappointed the compiler won't do that
> optimization for us..
I thought the same.
>
> Perhaps we should introduce a new qunique_eq() function with a different
> callback signature:
>
> /* like qunique(), but the callback function returns true/false rather
> than int */
> static inline size_t
> qunique_eq(void *array, size_t elements, size_t width,
> bool (*equal) (const void *, const void *))
>
I would prefer to change qunique() instead. That would enforce using an
adequate comparison function from the get go. There are only ~15 calls
to qunique(). So refactoring this should also be a fairly small patch. I
can do that if there's agreement for that approach.
>> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
>> of text is actually ASCII. Hence, we provide a fast path for this case
>> which is exercised if the MSB of the current character is unset.
>
> This uses pg_ascii_tolower() when for ASCII characters when built with
> the IGNORECASE. I don't think that's correct, if the proper collation
> would do something more complicated for than what pg_ascii_tolower() does.
Oh, that's evil. I had tested that specifically. But it only worked
because the code in master uses str_tolower() with
DEFAULT_COLLATION_OID. So using a different locale like in the following
example does something different than when creating a database with the
same locale.
postgres=# select lower('III' COLLATE "tr_TR");
lower
-------
ııı
postgres=# select show_trgm('III' COLLATE "tr_TR");
show_trgm
-------------------------
{" i"," ii","ii ",iii}
(1 row)
But when using tr_TR as default locale of the database the following
happens:
postgres=# select lower('III' COLLATE "tr_TR");
lower
-------
ııı
postgres=# select show_trgm('III');sü
show_trgm
---------------------------------------
{0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
I'm wondering if that's intentional to begin with. Shouldn't the code
instead pass PG_GET_COLLATION() to str_tolower()? Might require some
research to see how other index types handle locales.
Coming back to the original problem: the lengthy comment at the top of
pg_locale_libc.c, suggests that in some cases ASCII characters are
handled the pg_ascii_tolower() way for the default locale. See for
example tolower_libc_mb(). So a character by character conversion using
that function will yield a different result than strlower_libc_mb(). I'm
wondering why that is.
Anyways, we could limit the optimization to only kick in when the used
locale follows the same rules as pg_ascii_tolower(). We could test that
when creating the locale and store that info in pg_locale_struct.
Thoughts?
> Did you measure how big is the impact from each individual patch?
> Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
> they make any difference on their own.
Here is the impact of each patch. I ran again CREATE INDEX three times
and took the fastest run. The run of each patch includes all previous
patches as well. For example, the timings for patch 0003 were measured
with a binary that also had patch 0002 and 0001 applied. To get the
impact of each patch in isolation, the delta to the previous run was taken.
Code | movies |delta | lineitem | delta
------------------------------------|--------|-------|------------------
master | 10,311 | 0 | 256,986 | 0
v1-0001-Inline-ginCompareAttEntries | 9,694 | 617 | 239,778 | 17,208
v1-0002-Optimized-comparison-func | 9,510 | 184 | 238,094 | 1,684
v1-0003-Use-sort_template.h | 8,661 | 849 | 231,190 | 6,904
v1-0004-Avoid-dedup-and-sort-in | 9,305 | -644 | 232,472 | -1,282
v1-0005-Make-btint4cmp-branchless | 8,240 | 1,065 | 228,387 | 4,085
v1-0006-Use-radix-sort | 6,976 | 1,264 | 207,687 | 20,700
v1-0007-Faster-qunique-comparator | 5,911 | 1,065 | 203,744 | 3,943
v1-0008-Add-ASCII-fastpath | 3,409 | 2,502 | 161,469 | 42,275
Attached is v2 of the patch set with the aforementioned changes. I've
also fixed the white space errors in 0003, 0004 and 0008, as reported by
Kirill.
--
David Geier
Вложения
- v2-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch
- v2-0007-Faster-qunique-comparator.patch
- v2-0006-Use-radix-sort.patch
- v2-0005-Make-btint4cmp-branchless.patch
- v2-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
- v2-0003-Use-sort_template.h.patch
- v2-0002-Optimized-comparison-functions.patch
- v2-0001-Inline-ginCompareAttEntries.patch
Hi! > Hi! > patches 0003, 0004 & 0008 applies with whitespace errors. > I've fixed the white space errors. See v2 of the patch set in [1]. > I did run benchmarks of my VM using your data. v1-0001 solely improves > runtime by 4-5%. v2-0002 does not show runtime improvement for me. > With parallel GIN build, performance gains are 2-3 % for 2 workers and > not noticeable after that. > > I will try to run some more benchmarks on this. Thanks. I've also included the delta for each patch in [1]. I would be curious what you measure, especially for patch 0004, where I curiously measured a regression. [1] https://www.postgresql.org/message-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d%40gmail.com -- David Geier
On 09/01/2026 14:06, David Geier wrote: > On 06.01.2026 18:00, Heikki Linnakangas wrote: >> On 05/01/2026 17:01, David Geier wrote: >>> - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke() >>> instead of FunctionCall2Coll(). This saves a bunch of per-comparison >>> setup code, such as calling InitFunctionCallInfoData(). >> >> You lose the check for NULL result with this. That's probably still >> worth checking. > > It seems like existing code where all args are not null, has that safety > check. Added it for consistency. > >>> - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch >>> ginExtractEntries() deduplicates and sorts the entries returned from the >>> extract value function. In case of pg_trgm, that is completely redundant >>> because the trigrams are already deduplicated and sorted. The current >>> version of this patch is just to demonstrate the potential. We need to >>> think about what we want here. Ideally, we would require the extraction >>> function to provide the entries deduplicated and sorted. Alternatively, >>> we could indicate to ginExtractEntries() if the entries are already >>> deduplicated and sorted. If we don't want to alter the signature of the >>> extract value function, we could e.g. use the MSB of the nentries >>> argument. >> >> Yeah, this seems wrong as it is. You're assuming that if the extract >> function returns nullFlags == NULL, the array is already sorted and >> deduped. > > As said, that was just for demonstration purposes of the possible gains. > I've changed the code now such that the extractValue function of the GIN > index can indicate via the third argument uniqueAndSorted, if the > returned keys are already unique and sorted. > > Unfortunately, it seems like this patch regresses performance. See > measurements below. I haven't had the time to investigate why that is. > It's pretty counter intuitive, given that this patch effectively only > removes code. Maybe you could re-test patch 0004 and share your runtimes? Looking at 0001 and 0004 patches and ginExtractEntries(), the way ginExtractEntries() handles NULLs looks a little inefficient. It treats NULLs as proper entries, passing them through qsort() for deduplication and comparing them with cmpEntries(). But the end result is always that if the opclass's extractValueFn() function returned any NULLs, then there's exctly one GIN_CAT_NULL_KEY as the last entry of the result array. Surely we could be smarter about how we accomplish that. Your 0004 patch eliminates the deduplication overhead altogether, which is great of course, but the point remains for when we still need the deduplication. Attached is an attempt at that. It avoids the null-checks in cmpEntries(), saving a few cycles. That's drowned in noise with your test cases, but with the attached test case with int arrays, I'm seeing a 1-2 % gain. That's not much, but I think it's still worth doing because it makes the code a little simpler too, IMHO. (I didn't test it together with the rest of your patches.) >> Perhaps we should introduce a new qunique_eq() function with a different >> callback signature: >> >> /* like qunique(), but the callback function returns true/false rather >> than int */ >> static inline size_t >> qunique_eq(void *array, size_t elements, size_t width, >> bool (*equal) (const void *, const void *)) >> > > I would prefer to change qunique() instead. That would enforce using an > adequate comparison function from the get go. There are only ~15 calls > to qunique(). So refactoring this should also be a fairly small patch. I > can do that if there's agreement for that approach. Works for me. At quick glance, most if not all of the qunique() calls call qsort() just before qunique(). I wonder if we should have a single "sort and deduplicate" function, instead. It could perhaps do some deduplication "on the go", or other optimizations. >> Did you measure how big is the impact from each individual patch? >> Patches 1 and 2 seem pretty much ready to be committed, but I wonder if >> they make any difference on their own. > > Here is the impact of each patch. I ran again CREATE INDEX three times > and took the fastest run. The run of each patch includes all previous > patches as well. For example, the timings for patch 0003 were measured > with a binary that also had patch 0002 and 0001 applied. To get the > impact of each patch in isolation, the delta to the previous run was taken. > > Code | movies |delta | lineitem | delta > ------------------------------------|--------|-------|------------------ > master | 10,311 | 0 | 256,986 | 0 > v1-0001-Inline-ginCompareAttEntries | 9,694 | 617 | 239,778 | 17,208 > v1-0002-Optimized-comparison-func | 9,510 | 184 | 238,094 | 1,684 > v1-0003-Use-sort_template.h | 8,661 | 849 | 231,190 | 6,904 > v1-0004-Avoid-dedup-and-sort-in | 9,305 | -644 | 232,472 | -1,282 > v1-0005-Make-btint4cmp-branchless | 8,240 | 1,065 | 228,387 | 4,085 > v1-0006-Use-radix-sort | 6,976 | 1,264 | 207,687 | 20,700 > v1-0007-Faster-qunique-comparator | 5,911 | 1,065 | 203,744 | 3,943 > v1-0008-Add-ASCII-fastpath | 3,409 | 2,502 | 161,469 | 42,275 > > Attached is v2 of the patch set with the aforementioned changes. I've > also fixed the white space errors in 0003, 0004 and 0008, as reported by > Kirill. Thanks, I pushed patch 0001 now, that's a simple and clear win. - Heikki
Вложения
On 09.01.2026 19:36, Heikki Linnakangas wrote: >>>> - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch >>>> ginExtractEntries() deduplicates and sorts the entries returned from >>>> the >>>> extract value function. In case of pg_trgm, that is completely >>>> redundant >>>> because the trigrams are already deduplicated and sorted. The current >>>> version of this patch is just to demonstrate the potential. We need to >>>> think about what we want here. Ideally, we would require the extraction >>>> function to provide the entries deduplicated and sorted. Alternatively, >>>> we could indicate to ginExtractEntries() if the entries are already >>>> deduplicated and sorted. If we don't want to alter the signature of the >>>> extract value function, we could e.g. use the MSB of the nentries >>>> argument. >>> >>> Yeah, this seems wrong as it is. You're assuming that if the extract >>> function returns nullFlags == NULL, the array is already sorted and >>> deduped. >> >> As said, that was just for demonstration purposes of the possible gains. >> I've changed the code now such that the extractValue function of the GIN >> index can indicate via the third argument uniqueAndSorted, if the >> returned keys are already unique and sorted. >> >> Unfortunately, it seems like this patch regresses performance. See >> measurements below. I haven't had the time to investigate why that is. >> It's pretty counter intuitive, given that this patch effectively only >> removes code. Maybe you could re-test patch 0004 and share your runtimes? > > Looking at 0001 and 0004 patches and ginExtractEntries(), the way > ginExtractEntries() handles NULLs looks a little inefficient. It treats > NULLs as proper entries, passing them through qsort() for deduplication > and comparing them with cmpEntries(). But the end result is always that > if the opclass's extractValueFn() function returned any NULLs, then > there's exctly one GIN_CAT_NULL_KEY as the last entry of the result > array. Surely we could be smarter about how we accomplish that. Your > 0004 patch eliminates the deduplication overhead altogether, which is > great of course, but the point remains for when we still need the > deduplication. Good observation. I like this idea. I've focused on pg_trgm but making this code faster is certainly useful. Given that doing the sort on pre-sorted input apparently doesn't add measurable overhead, according to my benchmark results, we can apply your patch and leave mine out for the moment being. That's btw. also the reason for why 0002 doesn't show much gain: when the data is pre-sorted, cmpEntries() is not called as much. > > Attached is an attempt at that. It avoids the null-checks in > cmpEntries(), saving a few cycles. That's drowned in noise with your > test cases, but with the attached test case with int arrays, I'm seeing > a 1-2 % gain. That's not much, but I think it's still worth doing > because it makes the code a little simpler too, IMHO. (I didn't test it > together with the rest of your patches.) Performance is death by a thousand cuts and that's definitely the case for the GIN index code. I'm all for putting in these small improvements because they'll add up and what's now 2% can be 10% once other optimization are in. I took a look at your patch. Overall looks good to me. Just a few comments: 1) You should be able to create the categories array without the need for the subsequent for loop as follows: StaticAssertStmt(GIN_CAT_NORM_KEY == 0, "Assuming GIN_CAT_NORM_KEY=0"); *categories = palloc0_array(GinNullCategory, (nentries + (hasNull ? 1 : 0)); 2) Your test case seems sub-optimal to show the gains. The arrays don't contain any NULL values and are also already sorted. Or I'm missing something. 3) Could you try your patch in conjunction with 0002 on data that is not pre-sorted and then check if 0002 has more impact? That way cmpEntries() should be called much more often. 4) Have you checked if there are regression tests that exercise this code? If not, how about adding some? >>> Perhaps we should introduce a new qunique_eq() function with a different >>> callback signature: >>> >>> /* like qunique(), but the callback function returns true/false rather >>> than int */ >>> static inline size_t >>> qunique_eq(void *array, size_t elements, size_t width, >>> bool (*equal) (const void *, const void *)) >>> >> >> I would prefer to change qunique() instead. That would enforce using an >> adequate comparison function from the get go. There are only ~15 calls >> to qunique(). So refactoring this should also be a fairly small patch. I >> can do that if there's agreement for that approach. > > Works for me. > > At quick glance, most if not all of the qunique() calls call qsort() > just before qunique(). I wonder if we should have a single "sort and > deduplicate" function, instead. It could perhaps do some deduplication > "on the go", or other optimizations. If it's just for deduplication purposes and the data doesn't have to end up sorted, something based on a hash map should be even faster. How about we start with changing the qunique() comparator signature and as a 2nd step take a closer look at how to go about providing a function that does it in one go? If you agree, I'll share a patch here next week. >>> Did you measure how big is the impact from each individual patch? >>> Patches 1 and 2 seem pretty much ready to be committed, but I wonder if >>> they make any difference on their own. >> >> Here is the impact of each patch. I ran again CREATE INDEX three times >> and took the fastest run. The run of each patch includes all previous >> patches as well. For example, the timings for patch 0003 were measured >> with a binary that also had patch 0002 and 0001 applied. To get the >> impact of each patch in isolation, the delta to the previous run was >> taken. >> >> Code | movies |delta | lineitem | delta >> ------------------------------------|--------|-------|------------------ >> master | 10,311 | 0 | 256,986 | 0 >> v1-0001-Inline-ginCompareAttEntries | 9,694 | 617 | 239,778 | 17,208 >> v1-0002-Optimized-comparison-func | 9,510 | 184 | 238,094 | 1,684 >> v1-0003-Use-sort_template.h | 8,661 | 849 | 231,190 | 6,904 >> v1-0004-Avoid-dedup-and-sort-in | 9,305 | -644 | 232,472 | -1,282 >> v1-0005-Make-btint4cmp-branchless | 8,240 | 1,065 | 228,387 | 4,085 >> v1-0006-Use-radix-sort | 6,976 | 1,264 | 207,687 | 20,700 >> v1-0007-Faster-qunique-comparator | 5,911 | 1,065 | 203,744 | 3,943 >> v1-0008-Add-ASCII-fastpath | 3,409 | 2,502 | 161,469 | 42,275 >> >> Attached is v2 of the patch set with the aforementioned changes. I've >> also fixed the white space errors in 0003, 0004 and 0008, as reported by >> Kirill. > > Thanks, I pushed patch 0001 now, that's a simple and clear win. Great. Thanks. What about the other patches? 0003 and 0007 are also pretty simple and IMHO uncontroversial while giving decent savings. For 0006 I would make the code also work for char being unsigned. That's still missing. Any thoughts about 0008 and my findings regarding the to lower-case conversion for ASCII? Adding to pg_locale_struct if it's save to use ASCII-style tolower should be straight forward and then the code should be correct. -- David Geier
On 09/01/2026 14:06, David Geier wrote:
> On 06.01.2026 18:00, Heikki Linnakangas wrote:
>> On 05/01/2026 17:01, David Geier wrote:
>>> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
>>> of text is actually ASCII. Hence, we provide a fast path for this case
>>> which is exercised if the MSB of the current character is unset.
>>
>> This uses pg_ascii_tolower() when for ASCII characters when built with
>> the IGNORECASE. I don't think that's correct, if the proper collation
>> would do something more complicated for than what pg_ascii_tolower() does.
>
> Oh, that's evil. I had tested that specifically. But it only worked
> because the code in master uses str_tolower() with
> DEFAULT_COLLATION_OID. So using a different locale like in the following
> example does something different than when creating a database with the
> same locale.
>
> postgres=# select lower('III' COLLATE "tr_TR");
> lower
> -------
> ııı
>
> postgres=# select show_trgm('III' COLLATE "tr_TR");
> show_trgm
> -------------------------
> {" i"," ii","ii ",iii}
> (1 row)
>
> But when using tr_TR as default locale of the database the following
> happens:
>
> postgres=# select lower('III' COLLATE "tr_TR");
> lower
> -------
> ııı
>
> postgres=# select show_trgm('III');sü
> show_trgm
> ---------------------------------------
> {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
>
> I'm wondering if that's intentional to begin with. Shouldn't the code
> instead pass PG_GET_COLLATION() to str_tolower()? Might require some
> research to see how other index types handle locales.
>
> Coming back to the original problem: the lengthy comment at the top of
> pg_locale_libc.c, suggests that in some cases ASCII characters are
> handled the pg_ascii_tolower() way for the default locale. See for
> example tolower_libc_mb(). So a character by character conversion using
> that function will yield a different result than strlower_libc_mb(). I'm
> wondering why that is.
Hmm, yeah, that feels funny. The trigram code predates per-column
collation support, so I guess we never really thought through how it
should interact with COLLATE clauses.
> Anyways, we could limit the optimization to only kick in when the used
> locale follows the same rules as pg_ascii_tolower(). We could test that
> when creating the locale and store that info in pg_locale_struct.
I think that's only possible for libc locales, which operate one
character at a time. In ICU locales, lower-casing a character can depend
on the surrounding characters, so you cannot just test the conversion of
every ascii character individually. It would make sense for libc locales
though, and I hope the ICU functions are a little faster anyway.
Although, we probably should be using case-folding rather than
lower-casing with ICU locales anyway. Case-folding is designed for
string matching. It'd be a backwards-compatibility breaking change, though.
- Heikki
Hi Heikki! On 09.01.2026 22:02, David Geier wrote: > Given that doing the sort on pre-sorted input apparently doesn't add > measurable overhead, according to my benchmark results, we can apply > your patch and leave mine out for the moment being. I've removed my patch from the patch set in favor of your patch. I've adapted your patch slightly as follows: - I replaced the custom for-loop by qunique() - I switched to sort_template.h which gives a nice speedup as it can do some things more efficiently now where the entries are simple Datums. - I use palloc0_array() to initialize the array to GIN_CAT_NORM_KEY in one go. Your patch together with my changes gives a 20% speedup on a table with arrays of 1000 elements and 10% NULLs. See attached test script. > That's btw. also the reason for why 0002 doesn't show much gain: when > the data is pre-sorted, cmpEntries() is not called as much. This turned out to be not the case. I tested 0002 with the attached script but that neither showed any significant improvements. It's still curious to me because cmpEntries() is called hundreds of millions of times and the disassembly shows that the optimized function indeed directly calls the operator rather than having the indirection via FunctionCall2Coll(). Anyways, I've removed the patch from the patch set for the moment being. >>> I would prefer to change qunique() instead. That would enforce using an >>> adequate comparison function from the get go. There are only ~15 calls >>> to qunique(). So refactoring this should also be a fairly small patch. I >>> can do that if there's agreement for that approach. >> >> Works for me. >> >> At quick glance, most if not all of the qunique() calls call qsort() >> just before qunique(). I wonder if we should have a single "sort and >> deduplicate" function, instead. It could perhaps do some deduplication >> "on the go", or other optimizations. > > If it's just for deduplication purposes and the data doesn't have to end > up sorted, something based on a hash map should be even faster. How > about we start with changing the qunique() comparator signature and as a > 2nd step take a closer look at how to go about providing a function that > does it in one go? Thinking about this some more: ideally we have two functions: something like deduplicateArray() and sortAndDeduplicateArray(). We could initially implement deduplicateArray() on top of sortAndDeduplicateArray(). If we ever find a case that needs optimization, and doesn't require the data to actually end up sorted, we can implement deduplicateArray() e.g. on top of simplehash.h. I'll draft a patch and submit it in a separate thread. > What about the other patches? 0003 and 0007 are also pretty simple and > IMHO uncontroversial while giving decent savings. I've reordered the patches such that the ones that I think are uncontroversial, small and ready to be committed are at the beginning (patches 0001 - 0004). The radix sort and ASCII fast-path patches come last (0005 and 0006). I would like to first concentrate on getting 0001 - 0004 in and then get back to 0005 and 0006. I remeasured the savings of 0001 - 0004, which comes on top of the already committed patch that inlined the comparison function, which gave another ~5%: Data set | Patched (ms) | Master (ms) | Speedup --------------------|--------------|--------------|---------- movies(plot) | 8,058 | 10,311 | 1.27x lineitem(l_comment) | 223,233 | 256,986 | 1.19x -- David Geier
Вложения
- table_with_random_int_arrays.sql
- v3-0006-Add-ASCII-fastpath-to-generate_trgm_only.patch
- v3-0005-Optimize-generate_trgm-with-radix-sort.patch
- v3-0004-Faster-qunique-comparator-in-generate_trgm.patch
- v3-0003-Make-btint4cmp-branchless.patch
- v3-0002-Optimize-generate_trgm-with-sort_template.h.patch
- v3-0001-Optimize-sort-and-deduplication-in-ginExtractEntr.patch
>> Oh, that's evil. I had tested that specifically. But it only worked
>> because the code in master uses str_tolower() with
>> DEFAULT_COLLATION_OID. So using a different locale like in the following
>> example does something different than when creating a database with the
>> same locale.
>>
>> postgres=# select lower('III' COLLATE "tr_TR");
>> lower
>> -------
>> ııı
>>
>> postgres=# select show_trgm('III' COLLATE "tr_TR");
>> show_trgm
>> -------------------------
>> {" i"," ii","ii ",iii}
>> (1 row)
>>
>> But when using tr_TR as default locale of the database the following
>> happens:
>>
>> postgres=# select lower('III' COLLATE "tr_TR");
>> lower
>> -------
>> ııı
>>
>> postgres=# select show_trgm('III');sü
>> show_trgm
>> ---------------------------------------
>> {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
>>
>> I'm wondering if that's intentional to begin with. Shouldn't the code
>> instead pass PG_GET_COLLATION() to str_tolower()? Might require some
>> research to see how other index types handle locales.
>>
>> Coming back to the original problem: the lengthy comment at the top of
>> pg_locale_libc.c, suggests that in some cases ASCII characters are
>> handled the pg_ascii_tolower() way for the default locale. See for
>> example tolower_libc_mb(). So a character by character conversion using
>> that function will yield a different result than strlower_libc_mb(). I'm
>> wondering why that is.
>
> Hmm, yeah, that feels funny. The trigram code predates per-column
> collation support, so I guess we never really thought through how it
> should interact with COLLATE clauses.
I've written a patch to fix that. See [1].
>> Anyways, we could limit the optimization to only kick in when the used
>> locale follows the same rules as pg_ascii_tolower(). We could test that
>> when creating the locale and store that info in pg_locale_struct.
>
> I think that's only possible for libc locales, which operate one
> character at a time. In ICU locales, lower-casing a character can depend
> on the surrounding characters, so you cannot just test the conversion of
> every ascii character individually. It would make sense for libc locales
> though, and I hope the ICU functions are a little faster anyway.
>
> Although, we probably should be using case-folding rather than lower-
> casing with ICU locales anyway. Case-folding is designed for string
> matching. It'd be a backwards-compatibility breaking change, though.
Oh, I wasn't ware of that. Doing it only for libc locales seems still
useful.
Good point with the casefolding. I'll look into that.
How do we usually go about such backwards-compatibility breaking
changes? Could we have pg_upgrade reindex all GIN indexes? Would that be
acceptable?
[1]
https://www.postgresql.org/message-id/flat/db087c3e-230e-4119-8a03-8b5d74956bc2%40gmail.com
--
David Geier
On Wed, 21 Jan 2026 at 16:45, David Geier <geidav.pg@gmail.com> wrote: > > How do we usually go about such backwards-compatibility breaking > changes? When it concerns a bug, we mention the change in the release notes with a warning to reindex affected indexes to be sure no known corruption remains. See e.g. the final entry in the PG18 release notes' migration section here: https://www.postgresql.org/docs/18/release-18.html#RELEASE-18-MIGRATION. > Could we have pg_upgrade reindex all GIN indexes? Would that be > acceptable? No. We'd handle this like any other collation/opclass fixes; we ask users to reindex their indexes in their own time after they've upgraded their cluster. Note that in this case it concerns an issue with just one GIN opclass, not all GIN indexes; so even if we were to address this in pg_upgrade it wouldn't be a correct choice to reindex every GIN index, as only a subset of those would be affected by this issue. Generally speaking, pg_upgrade doesn't concern itself with the validity of the data structures that are described by the catalogs that it upgrades, it only concerns itself with that it correctly transcribes the catalogs from one version to another, and that the data files of the old cluster are transfered correctly without changes. Kind regards, Matthias van de Meent Databricks (https://www.databricks.com)
Hi Matthias, On 21.01.2026 21:50, Matthias van de Meent wrote: > On Wed, 21 Jan 2026 at 16:45, David Geier <geidav.pg@gmail.com> wrote: >> >> How do we usually go about such backwards-compatibility breaking >> changes? > > When it concerns a bug, we mention the change in the release notes > with a warning to reindex affected indexes to be sure no known > corruption remains. See e.g. the final entry in the PG18 release > notes' migration section here: > https://www.postgresql.org/docs/18/release-18.html#RELEASE-18-MIGRATION. > >> Could we have pg_upgrade reindex all GIN indexes? Would that be >> acceptable? > > No. We'd handle this like any other collation/opclass fixes; we ask > users to reindex their indexes in their own time after they've > upgraded their cluster. Note that in this case it concerns an issue > with just one GIN opclass, not all GIN indexes; so even if we were to > address this in pg_upgrade it wouldn't be a correct choice to reindex > every GIN index, as only a subset of those would be affected by this > issue. > > Generally speaking, pg_upgrade doesn't concern itself with the > validity of the data structures that are described by the catalogs > that it upgrades, it only concerns itself with that it correctly > transcribes the catalogs from one version to another, and that the > data files of the old cluster are transfered correctly without > changes. Thanks for the clarifications and the link to the release notes. That's very helpful. Then I know how to move on and will update the patch accordingly. -- David Geier
On 23.01.2026 11:18, David Geier wrote: > Hi Matthias, > > On 21.01.2026 21:50, Matthias van de Meent wrote: >> On Wed, 21 Jan 2026 at 16:45, David Geier <geidav.pg@gmail.com> wrote: >>> >>> How do we usually go about such backwards-compatibility breaking >>> changes? >> >> When it concerns a bug, we mention the change in the release notes >> with a warning to reindex affected indexes to be sure no known >> corruption remains. See e.g. the final entry in the PG18 release >> notes' migration section here: >> https://www.postgresql.org/docs/18/release-18.html#RELEASE-18-MIGRATION. >> >>> Could we have pg_upgrade reindex all GIN indexes? Would that be >>> acceptable? >> >> No. We'd handle this like any other collation/opclass fixes; we ask >> users to reindex their indexes in their own time after they've >> upgraded their cluster. Note that in this case it concerns an issue >> with just one GIN opclass, not all GIN indexes; so even if we were to >> address this in pg_upgrade it wouldn't be a correct choice to reindex >> every GIN index, as only a subset of those would be affected by this >> issue. >> >> Generally speaking, pg_upgrade doesn't concern itself with the >> validity of the data structures that are described by the catalogs >> that it upgrades, it only concerns itself with that it correctly >> transcribes the catalogs from one version to another, and that the >> data files of the old cluster are transfered correctly without >> changes. > > Thanks for the clarifications and the link to the release notes. That's > very helpful. Then I know how to move on and will update the patch > accordingly. Attached are the patches rebased on latest master. I've removed the ASCII fast-path patch 0006 as it turned out to be more complicated to make work than expected. I kept the radix sort patch because it gives a decent speedup but I would like to focus for now on getting patches 0001 - 0004 merged. They're all simple and, the way I see it, uncontroversial. I remeasured the savings of 0001 - 0004, which comes on top of the already committed patch that inlined the comparison function, which gave another ~5%: Data set | Patched (ms) | Master (ms) | Speedup --------------------|--------------|--------------|---------- movies(plot) | 8,058 | 10,311 | 1.27x lineitem(l_comment) | 223,233 | 256,986 | 1.19x I've also registered the change at the commit fest, see https://commitfest.postgresql.org/patch/6418/. -- David Geier
Вложения
> Attached are the patches rebased on latest master. > > I've removed the ASCII fast-path patch 0006 as it turned out to be more > complicated to make work than expected. > > I kept the radix sort patch because it gives a decent speedup but I > would like to focus for now on getting patches 0001 - 0004 merged. > They're all simple and, the way I see it, uncontroversial. > > I remeasured the savings of 0001 - 0004, which comes on top of the > already committed patch that inlined the comparison function, which gave > another ~5%: > > Data set | Patched (ms) | Master (ms) | Speedup > --------------------|--------------|--------------|---------- > movies(plot) | 8,058 | 10,311 | 1.27x > lineitem(l_comment) | 223,233 | 256,986 | 1.19x > > I've also registered the change at the commit fest, see > https://commitfest.postgresql.org/patch/6418/. Attached is v5 that removes an incorrect assertion from the radix sort code. -- David Geier