Обсуждение: Reduce build times of pg_trgm GIN indexes

Поиск

Список

Период

Сортировка

Reduce build times of pg_trgm GIN indexes

От

David Geier

Дата:

05 января, 18:01:44

Hi hackers,

Attached is a series of patches which gradually reduce the time it takes
to create GIN indexes. Most of the gains come from optimizing the
trigram extraction code in pg_trgm. A few small optimizations apply to
any GIN index operator class.

The changes are motivated by the time it takes to create GIN indexes on
large production tables, especially, on columns with long strings. Even
with multiple parallel maintenance workers I've seen this taking hours.

For testing purposes I've used two different data sets:

1. The l_comment column of the TPC-H SF 10 lineitem table. l_comment
contains relatively short strings with a minimum of 10, a maximum of 43
and an average of ~27 characters.

2. The plots from a collection of movies from Wikipedia. The plots are
much longer than l_comment, with a minimum of 15, a maximum of 36,773
and an average of ~2,165 characters. The CSV file can be downloaded here
[1].

Testing both cases is important because a big part of the trigram
extraction is spent on removing duplicates. The longer the string, the
more duplicates are usually encountered.

The script I used for testing is attached. I ran CREATE INDEX three
times and took the fastest run. I'm getting the following results on my
i9-13950HX dev laptop in release build:

Data set | Patched (ms) | Master (ms) | Speedup
--------------------|--------------|--------------|----------
movies(plot) | 3,409 | 10,311 | 3.02x
lineitem(l_comment) | 161,569 | 256,986 | 1.59x

The attached patches do the following:

- v1-0001-Inline-ginCompareAttEntries.patch: Inline
ginCompareAttEntries() which is very frequently called by the GIN code.

- v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
instead of FunctionCall2Coll(). This saves a bunch of per-comparison
setup code, such as calling InitFunctionCallInfoData().

- v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of
qsort(), to inline calls to the sort comparator. This is an interim step
that is further improved on by patch 0006.

- v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
ginExtractEntries() deduplicates and sorts the entries returned from the
extract value function. In case of pg_trgm, that is completely redundant
because the trigrams are already deduplicated and sorted. The current
version of this patch is just to demonstrate the potential. We need to
think about what we want here. Ideally, we would require the extraction
function to provide the entries deduplicated and sorted. Alternatively,
we could indicate to ginExtractEntries() if the entries are already
deduplicated and sorted. If we don't want to alter the signature of the
extract value function, we could e.g. use the MSB of the nentries argument.

- v1-0005-Make-btint4cmp-branchless.patch: Removes branches from
btint4cmp(), which is heavily called from the GIN code. This might as
well have benefit in other parts of the code base.

v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort()
with radix sort. For the purpose of demonstrating the possible gains,
I've only replaced the signed variant for now. I've also tried using
simplehash.h for deduplicating followed by a sort_template.h-based sort.
But that was slower.

v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
optimization can be done in plenty of other places in our code base.
Likely, in most of them the gains are insignificant.

v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
of text is actually ASCII. Hence, we provide a fast path for this case
which is exercised if the MSB of the current character is unset.

With above changes, the majority of the runtime is now spent on
inserting the trigrams into the GIN index via ginInsertBAEntry(). The
code in master uses a red-black for further deduplication and sorting.
Traversing the red-black tree and updating it is pretty slow. I haven't
looked through all the code yet, but it seems to me that we would be
better off replacing the red-black tree with a sort and/or a hash map.
But I'll leave this as future work for now.

[1]
https://github.com/kiq005/movie-recommendation/raw/refs/heads/master/src/dataset/wiki_movie_plots_deduped.csv

--
David Geier

Вложения

Re: Reduce build times of pg_trgm GIN indexes

От

Heikki Linnakangas

Дата:

06 января, 20:00:23

On 05/01/2026 17:01, David Geier wrote:
> The script I used for testing is attached. I ran CREATE INDEX three
> times and took the fastest run. I'm getting the following results on my
> i9-13950HX dev laptop in release build:
> 
> Data set            | Patched (ms) | Master (ms)  | Speedup
> --------------------|--------------|--------------|----------
> movies(plot)        |   3,409      |  10,311      | 3.02x
> lineitem(l_comment) | 161,569      | 256,986      | 1.59x
> 

Impressive speedup!

> The attached patches do the following:
> 
> - v1-0001-Inline-ginCompareAttEntries.patch: Inline
> ginCompareAttEntries() which is very frequently called by the GIN code.

Looks good.

> - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
> instead of FunctionCall2Coll(). This saves a bunch of per-comparison
> setup code, such as calling InitFunctionCallInfoData().

You lose the check for NULL result with this. That's probably still 
worth checking.

> - v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of
> qsort(), to inline calls to the sort comparator. This is an interim step
> that is further improved on by patch 0006.

ok

> - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
> ginExtractEntries() deduplicates and sorts the entries returned from the
> extract value function. In case of pg_trgm, that is completely redundant
> because the trigrams are already deduplicated and sorted. The current
> version of this patch is just to demonstrate the potential. We need to
> think about what we want here. Ideally, we would require the extraction
> function to provide the entries deduplicated and sorted. Alternatively,
> we could indicate to ginExtractEntries() if the entries are already
> deduplicated and sorted. If we don't want to alter the signature of the
> extract value function, we could e.g. use the MSB of the nentries argument.

Yeah, this seems wrong as it is. You're assuming that if the extract 
function returns nullFlags == NULL, the array is already sorted and deduped.

> - v1-0005-Make-btint4cmp-branchless.patch: Removes branches from
> btint4cmp(), which is heavily called from the GIN code. This might as
> well have benefit in other parts of the code base.

Seems reasonable.

> v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort()
> with radix sort. For the purpose of demonstrating the possible gains,
> I've only replaced the signed variant for now. I've also tried using
> simplehash.h for deduplicating followed by a sort_template.h-based sort.
> But that was slower.

Ok.

> v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
> full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
> equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
> optimization can be done in plenty of other places in our code base.
> Likely, in most of them the gains are insignificant.

Makes sense. I'm a little disappointed the compiler won't do that 
optimization for us..

Perhaps we should introduce a new qunique_eq() function with a different 
callback signature:

/* like qunique(), but the callback function returns true/false rather 
than int */
static inline size_t
qunique_eq(void *array, size_t elements, size_t width,
        bool (*equal) (const void *, const void *))

> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
> of text is actually ASCII. Hence, we provide a fast path for this case
> which is exercised if the MSB of the current character is unset.

This uses pg_ascii_tolower() when for ASCII characters when built with 
the IGNORECASE. I don't think that's correct, if the proper collation 
would do something more complicated for than what pg_ascii_tolower() does.

Did you measure how big is the impact from each individual patch? 
Patches 1 and 2 seem pretty much ready to be committed, but I wonder if 
they make any difference on their own.

- Heikki

Re: Reduce build times of pg_trgm GIN indexes

От

Kirill Reshke

Дата:

06 января, 23:05:45

On Tue, 6 Jan 2026 at 22:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 05/01/2026 17:01, David Geier wrote:
> > The script I used for testing is attached. I ran CREATE INDEX three
> > times and took the fastest run. I'm getting the following results on my
> > i9-13950HX dev laptop in release build:
> >
> > Data set            | Patched (ms) | Master (ms)  | Speedup
> > --------------------|--------------|--------------|----------
> > movies(plot)        |   3,409      |  10,311      | 3.02x
> > lineitem(l_comment) | 161,569      | 256,986      | 1.59x
> >
>
> Impressive speedup!
>
> > The attached patches do the following:
> >
> > - v1-0001-Inline-ginCompareAttEntries.patch: Inline
> > ginCompareAttEntries() which is very frequently called by the GIN code.
>
> Looks good.
>
> > - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
> > instead of FunctionCall2Coll(). This saves a bunch of per-comparison
> > setup code, such as calling InitFunctionCallInfoData().
>
> You lose the check for NULL result with this. That's probably still
> worth checking.
>
> > - v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of
> > qsort(), to inline calls to the sort comparator. This is an interim step
> > that is further improved on by patch 0006.
>
> ok
>
> > - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
> > ginExtractEntries() deduplicates and sorts the entries returned from the
> > extract value function. In case of pg_trgm, that is completely redundant
> > because the trigrams are already deduplicated and sorted. The current
> > version of this patch is just to demonstrate the potential. We need to
> > think about what we want here. Ideally, we would require the extraction
> > function to provide the entries deduplicated and sorted. Alternatively,
> > we could indicate to ginExtractEntries() if the entries are already
> > deduplicated and sorted. If we don't want to alter the signature of the
> > extract value function, we could e.g. use the MSB of the nentries argument.
>
> Yeah, this seems wrong as it is. You're assuming that if the extract
> function returns nullFlags == NULL, the array is already sorted and deduped.
>
> > - v1-0005-Make-btint4cmp-branchless.patch: Removes branches from
> > btint4cmp(), which is heavily called from the GIN code. This might as
> > well have benefit in other parts of the code base.
>
> Seems reasonable.
>
> > v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort()
> > with radix sort. For the purpose of demonstrating the possible gains,
> > I've only replaced the signed variant for now. I've also tried using
> > simplehash.h for deduplicating followed by a sort_template.h-based sort.
> > But that was slower.
>
> Ok.
>
> > v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
> > full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
> > equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
> > optimization can be done in plenty of other places in our code base.
> > Likely, in most of them the gains are insignificant.
>
> Makes sense. I'm a little disappointed the compiler won't do that
> optimization for us..
>
> Perhaps we should introduce a new qunique_eq() function with a different
> callback signature:
>
> /* like qunique(), but the callback function returns true/false rather
> than int */
> static inline size_t
> qunique_eq(void *array, size_t elements, size_t width,
>                 bool (*equal) (const void *, const void *))
>
> > v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
> > of text is actually ASCII. Hence, we provide a fast path for this case
> > which is exercised if the MSB of the current character is unset.
>
> This uses pg_ascii_tolower() when for ASCII characters when built with
> the IGNORECASE. I don't think that's correct, if the proper collation
> would do something more complicated for than what pg_ascii_tolower() does.
>
> Did you measure how big is the impact from each individual patch?
> Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
> they make any difference on their own.
>
> - Heikki
>
>
>


Hi!
patches 0003, 0004 & 0008 applies with whitespace errors.


reshke@yezzey-cbdb-bench:~/pgpure$ git am v1-0003-Use-sort_template.h.patch
Applying: Use sort_template.h
.git/rebase-apply/patch:66: trailing whitespace.

warning: 1 line adds whitespace errors.
reshke@yezzey-cbdb-bench:~/pgpure$ git am
v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
Applying: Avoid dedup and sort in ginExtractEntries
.git/rebase-apply/patch:30: trailing whitespace.
{
warning: 1 line adds whitespace errors.
reshke@yezzey-cbdb-bench:~/pgpure$ git am
v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch
Applying: Add ASCII fastpath to generate_trgm_only()
.git/rebase-apply/patch:101: trailing whitespace.
else
warning: 1 line adds whitespace errors.


I did run benchmarks of my VM using your data. v1-0001 solely improves
runtime by 4-5%. v2-0002 does not show runtime improvement for me.
With parallel GIN build, performance gains are 2-3 % for 2 workers and
not noticeable after that.

I will try to run some more benchmarks on this.

-- 
Best regards,
Kirill Reshke

Re: Reduce build times of pg_trgm GIN indexes

От

David Geier

Дата:

09 января, 15:06:42

Hi Heikki!

Thanks for looking at the patch set.

On 06.01.2026 18:00, Heikki Linnakangas wrote:
> On 05/01/2026 17:01, David Geier wrote:
>> - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
>> instead of FunctionCall2Coll(). This saves a bunch of per-comparison
>> setup code, such as calling InitFunctionCallInfoData().
> 
> You lose the check for NULL result with this. That's probably still
> worth checking.

It seems like existing code where all args are not null, has that safety
check. Added it for consistency.

>> - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
>> ginExtractEntries() deduplicates and sorts the entries returned from the
>> extract value function. In case of pg_trgm, that is completely redundant
>> because the trigrams are already deduplicated and sorted. The current
>> version of this patch is just to demonstrate the potential. We need to
>> think about what we want here. Ideally, we would require the extraction
>> function to provide the entries deduplicated and sorted. Alternatively,
>> we could indicate to ginExtractEntries() if the entries are already
>> deduplicated and sorted. If we don't want to alter the signature of the
>> extract value function, we could e.g. use the MSB of the nentries
>> argument.
> 
> Yeah, this seems wrong as it is. You're assuming that if the extract
> function returns nullFlags == NULL, the array is already sorted and
> deduped.

As said, that was just for demonstration purposes of the possible gains.
I've changed the code now such that the extractValue function of the GIN
index can indicate via the third argument uniqueAndSorted, if the
returned keys are already unique and sorted.

Unfortunately, it seems like this patch regresses performance. See
measurements below. I haven't had the time to investigate why that is.
It's pretty counter intuitive, given that this patch effectively only
removes code. Maybe you could re-test patch 0004 and share your runtimes?

>> v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
>> full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
>> equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
>> optimization can be done in plenty of other places in our code base.
>> Likely, in most of them the gains are insignificant.
> 
> Makes sense. I'm a little disappointed the compiler won't do that
> optimization for us..

I thought the same.

> 
> Perhaps we should introduce a new qunique_eq() function with a different
> callback signature:
> 
> /* like qunique(), but the callback function returns true/false rather
> than int */
> static inline size_t
> qunique_eq(void *array, size_t elements, size_t width,
>         bool (*equal) (const void *, const void *))
> 

I would prefer to change qunique() instead. That would enforce using an
adequate comparison function from the get go. There are only ~15 calls
to qunique(). So refactoring this should also be a fairly small patch. I
can do that if there's agreement for that approach.

>> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
>> of text is actually ASCII. Hence, we provide a fast path for this case
>> which is exercised if the MSB of the current character is unset.
> 
> This uses pg_ascii_tolower() when for ASCII characters when built with
> the IGNORECASE. I don't think that's correct, if the proper collation
> would do something more complicated for than what pg_ascii_tolower() does.

Oh, that's evil. I had tested that specifically. But it only worked
because the code in master uses str_tolower() with
DEFAULT_COLLATION_OID. So using a different locale like in the following
example does something different than when creating a database with the
same locale.

postgres=# select lower('III' COLLATE "tr_TR");
 lower
-------
 ııı

postgres=# select show_trgm('III' COLLATE "tr_TR");
        show_trgm
-------------------------
 {"  i"," ii","ii ",iii}
(1 row)

But when using tr_TR as default locale of the database the following
happens:

postgres=# select lower('III' COLLATE "tr_TR");
 lower
-------
 ııı

postgres=# select show_trgm('III');sü
               show_trgm
---------------------------------------
 {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}

I'm wondering if that's intentional to begin with. Shouldn't the code
instead pass PG_GET_COLLATION() to str_tolower()? Might require some
research to see how other index types handle locales.

Coming back to the original problem: the lengthy comment at the top of
pg_locale_libc.c, suggests that in some cases ASCII characters are
handled the pg_ascii_tolower() way for the default locale. See for
example tolower_libc_mb(). So a character by character conversion using
that function will yield a different result than strlower_libc_mb(). I'm
wondering why that is.

Anyways, we could limit the optimization to only kick in when the used
locale follows the same rules as pg_ascii_tolower(). We could test that
when creating the locale and store that info in pg_locale_struct.

Thoughts?

> Did you measure how big is the impact from each individual patch?
> Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
> they make any difference on their own.

Here is the impact of each patch. I ran again CREATE INDEX three times
and took the fastest run. The run of each patch includes all previous
patches as well. For example, the timings for patch 0003 were measured
with a binary that also had patch 0002 and 0001 applied. To get the
impact of each patch in isolation, the delta to the previous run was taken.

Code                                | movies |delta  | lineitem | delta
------------------------------------|--------|-------|------------------
master                              | 10,311 | 0     | 256,986  | 0
v1-0001-Inline-ginCompareAttEntries |  9,694 | 617   | 239,778  | 17,208
v1-0002-Optimized-comparison-func   |  9,510 | 184   | 238,094  |  1,684
v1-0003-Use-sort_template.h         |  8,661 | 849   | 231,190  |  6,904
v1-0004-Avoid-dedup-and-sort-in     |  9,305 | -644  | 232,472  | -1,282
v1-0005-Make-btint4cmp-branchless   |  8,240 | 1,065 | 228,387  |  4,085
v1-0006-Use-radix-sort              |  6,976 | 1,264 | 207,687  | 20,700
v1-0007-Faster-qunique-comparator   |  5,911 | 1,065 | 203,744  |  3,943
v1-0008-Add-ASCII-fastpath          |  3,409 | 2,502 | 161,469  | 42,275

Attached is v2 of the patch set with the aforementioned changes. I've
also fixed the white space errors in 0003, 0004 and 0008, as reported by
Kirill.

--
David Geier

Вложения

Re: Reduce build times of pg_trgm GIN indexes

От

David Geier

Дата:

09 января, 15:10:38

Hi!

> Hi!
> patches 0003, 0004 & 0008 applies with whitespace errors.
> 
I've fixed the white space errors. See v2 of the patch set in [1].

> I did run benchmarks of my VM using your data. v1-0001 solely improves
> runtime by 4-5%. v2-0002 does not show runtime improvement for me.
> With parallel GIN build, performance gains are 2-3 % for 2 workers and
> not noticeable after that.
> 
> I will try to run some more benchmarks on this.

Thanks. I've also included the delta for each patch in [1]. I would be
curious what you measure, especially for patch 0004, where I curiously
measured a regression.

[1]
https://www.postgresql.org/message-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d%40gmail.com

--
David Geier

Re: Reduce build times of pg_trgm GIN indexes

От

Heikki Linnakangas

Дата:

09 января, 21:36:26

On 09/01/2026 14:06, David Geier wrote:
> On 06.01.2026 18:00, Heikki Linnakangas wrote:
>> On 05/01/2026 17:01, David Geier wrote:
>>> - v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
>>> instead of FunctionCall2Coll(). This saves a bunch of per-comparison
>>> setup code, such as calling InitFunctionCallInfoData().
>>
>> You lose the check for NULL result with this. That's probably still
>> worth checking.
> 
> It seems like existing code where all args are not null, has that safety
> check. Added it for consistency.
> 
>>> - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
>>> ginExtractEntries() deduplicates and sorts the entries returned from the
>>> extract value function. In case of pg_trgm, that is completely redundant
>>> because the trigrams are already deduplicated and sorted. The current
>>> version of this patch is just to demonstrate the potential. We need to
>>> think about what we want here. Ideally, we would require the extraction
>>> function to provide the entries deduplicated and sorted. Alternatively,
>>> we could indicate to ginExtractEntries() if the entries are already
>>> deduplicated and sorted. If we don't want to alter the signature of the
>>> extract value function, we could e.g. use the MSB of the nentries
>>> argument.
>>
>> Yeah, this seems wrong as it is. You're assuming that if the extract
>> function returns nullFlags == NULL, the array is already sorted and
>> deduped.
> 
> As said, that was just for demonstration purposes of the possible gains.
> I've changed the code now such that the extractValue function of the GIN
> index can indicate via the third argument uniqueAndSorted, if the
> returned keys are already unique and sorted.
> 
> Unfortunately, it seems like this patch regresses performance. See
> measurements below. I haven't had the time to investigate why that is.
> It's pretty counter intuitive, given that this patch effectively only
> removes code. Maybe you could re-test patch 0004 and share your runtimes?

Looking at 0001 and 0004 patches and ginExtractEntries(), the way 
ginExtractEntries() handles NULLs looks a little inefficient. It treats 
NULLs as proper entries, passing them through qsort() for deduplication 
and comparing them with cmpEntries(). But the end result is always that 
if the opclass's extractValueFn() function returned any NULLs, then 
there's exctly one GIN_CAT_NULL_KEY as the last entry of the result 
array. Surely we could be smarter about how we accomplish that. Your 
0004 patch eliminates the deduplication overhead altogether, which is 
great of course, but the point remains for when we still need the 
deduplication.

Attached is an attempt at that. It avoids the null-checks in 
cmpEntries(), saving a few cycles. That's drowned in noise with your 
test cases, but with the attached test case with int arrays, I'm seeing 
a 1-2 % gain. That's not much, but I think it's still worth doing 
because it makes the code a little simpler too, IMHO. (I didn't test it 
together with the rest of your patches.)

>> Perhaps we should introduce a new qunique_eq() function with a different
>> callback signature:
>>
>> /* like qunique(), but the callback function returns true/false rather
>> than int */
>> static inline size_t
>> qunique_eq(void *array, size_t elements, size_t width,
>>          bool (*equal) (const void *, const void *))
>>
> 
> I would prefer to change qunique() instead. That would enforce using an
> adequate comparison function from the get go. There are only ~15 calls
> to qunique(). So refactoring this should also be a fairly small patch. I
> can do that if there's agreement for that approach.

Works for me.

At quick glance, most if not all of the qunique() calls call qsort() 
just before qunique(). I wonder if we should have a single "sort and 
deduplicate" function, instead. It could perhaps do some deduplication 
"on the go", or other optimizations.

>> Did you measure how big is the impact from each individual patch?
>> Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
>> they make any difference on their own.
> 
> Here is the impact of each patch. I ran again CREATE INDEX three times
> and took the fastest run. The run of each patch includes all previous
> patches as well. For example, the timings for patch 0003 were measured
> with a binary that also had patch 0002 and 0001 applied. To get the
> impact of each patch in isolation, the delta to the previous run was taken.
> 
> Code                                | movies |delta  | lineitem | delta
> ------------------------------------|--------|-------|------------------
> master                              | 10,311 | 0     | 256,986  | 0
> v1-0001-Inline-ginCompareAttEntries |  9,694 | 617   | 239,778  | 17,208
> v1-0002-Optimized-comparison-func   |  9,510 | 184   | 238,094  |  1,684
> v1-0003-Use-sort_template.h         |  8,661 | 849   | 231,190  |  6,904
> v1-0004-Avoid-dedup-and-sort-in     |  9,305 | -644  | 232,472  | -1,282
> v1-0005-Make-btint4cmp-branchless   |  8,240 | 1,065 | 228,387  |  4,085
> v1-0006-Use-radix-sort              |  6,976 | 1,264 | 207,687  | 20,700
> v1-0007-Faster-qunique-comparator   |  5,911 | 1,065 | 203,744  |  3,943
> v1-0008-Add-ASCII-fastpath          |  3,409 | 2,502 | 161,469  | 42,275
> 
> Attached is v2 of the patch set with the aforementioned changes. I've
> also fixed the white space errors in 0003, 0004 and 0008, as reported by
> Kirill.

Thanks, I pushed patch 0001 now, that's a simple and clear win.

- Heikki

Вложения

Re: Reduce build times of pg_trgm GIN indexes

От

David Geier

Дата:

10 января, 00:02:01

On 09.01.2026 19:36, Heikki Linnakangas wrote:
>>>> - v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
>>>> ginExtractEntries() deduplicates and sorts the entries returned from
>>>> the
>>>> extract value function. In case of pg_trgm, that is completely
>>>> redundant
>>>> because the trigrams are already deduplicated and sorted. The current
>>>> version of this patch is just to demonstrate the potential. We need to
>>>> think about what we want here. Ideally, we would require the extraction
>>>> function to provide the entries deduplicated and sorted. Alternatively,
>>>> we could indicate to ginExtractEntries() if the entries are already
>>>> deduplicated and sorted. If we don't want to alter the signature of the
>>>> extract value function, we could e.g. use the MSB of the nentries
>>>> argument.
>>>
>>> Yeah, this seems wrong as it is. You're assuming that if the extract
>>> function returns nullFlags == NULL, the array is already sorted and
>>> deduped.
>>
>> As said, that was just for demonstration purposes of the possible gains.
>> I've changed the code now such that the extractValue function of the GIN
>> index can indicate via the third argument uniqueAndSorted, if the
>> returned keys are already unique and sorted.
>>
>> Unfortunately, it seems like this patch regresses performance. See
>> measurements below. I haven't had the time to investigate why that is.
>> It's pretty counter intuitive, given that this patch effectively only
>> removes code. Maybe you could re-test patch 0004 and share your runtimes?
> 
> Looking at 0001 and 0004 patches and ginExtractEntries(), the way
> ginExtractEntries() handles NULLs looks a little inefficient. It treats
> NULLs as proper entries, passing them through qsort() for deduplication
> and comparing them with cmpEntries(). But the end result is always that
> if the opclass's extractValueFn() function returned any NULLs, then
> there's exctly one GIN_CAT_NULL_KEY as the last entry of the result
> array. Surely we could be smarter about how we accomplish that. Your
> 0004 patch eliminates the deduplication overhead altogether, which is
> great of course, but the point remains for when we still need the
> deduplication.
Good observation. I like this idea. I've focused on pg_trgm but making
this code faster is certainly useful.

Given that doing the sort on pre-sorted input apparently doesn't add
measurable overhead, according to my benchmark results, we can apply
your patch and leave mine out for the moment being.

That's btw. also the reason for why 0002 doesn't show much gain: when
the data is pre-sorted, cmpEntries() is not called as much.
> 
> Attached is an attempt at that. It avoids the null-checks in
> cmpEntries(), saving a few cycles. That's drowned in noise with your
> test cases, but with the attached test case with int arrays, I'm seeing
> a 1-2 % gain. That's not much, but I think it's still worth doing
> because it makes the code a little simpler too, IMHO. (I didn't test it
> together with the rest of your patches.)

Performance is death by a thousand cuts and that's definitely the case
for the GIN index code. I'm all for putting in these small improvements
because they'll add up and what's now 2% can be 10% once other
optimization are in.

I took a look at your patch. Overall looks good to me. Just a few comments:

1) You should be able to create the categories array without the need
for the subsequent for loop as follows:

StaticAssertStmt(GIN_CAT_NORM_KEY == 0, "Assuming GIN_CAT_NORM_KEY=0");
*categories = palloc0_array(GinNullCategory, (nentries + (hasNull ? 1 : 0));

2) Your test case seems sub-optimal to show the gains. The arrays don't
contain any NULL values and are also already sorted. Or I'm missing
something.

3) Could you try your patch in conjunction with 0002 on data that is not
pre-sorted and then check if 0002 has more impact? That way cmpEntries()
should be called much more often.

4) Have you checked if there are regression tests that exercise this
code? If not, how about adding some?

>>> Perhaps we should introduce a new qunique_eq() function with a different
>>> callback signature:
>>>
>>> /* like qunique(), but the callback function returns true/false rather
>>> than int */
>>> static inline size_t
>>> qunique_eq(void *array, size_t elements, size_t width,
>>>          bool (*equal) (const void *, const void *))
>>>
>>
>> I would prefer to change qunique() instead. That would enforce using an
>> adequate comparison function from the get go. There are only ~15 calls
>> to qunique(). So refactoring this should also be a fairly small patch. I
>> can do that if there's agreement for that approach.
> 
> Works for me.
> 
> At quick glance, most if not all of the qunique() calls call qsort()
> just before qunique(). I wonder if we should have a single "sort and
> deduplicate" function, instead. It could perhaps do some deduplication
> "on the go", or other optimizations.

If it's just for deduplication purposes and the data doesn't have to end
up sorted, something based on a hash map should be even faster. How
about we start with changing the qunique() comparator signature and as a
2nd step take a closer look at how to go about providing a function that
does it in one go?

If you agree, I'll share a patch here next week.

>>> Did you measure how big is the impact from each individual patch?
>>> Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
>>> they make any difference on their own.
>>
>> Here is the impact of each patch. I ran again CREATE INDEX three times
>> and took the fastest run. The run of each patch includes all previous
>> patches as well. For example, the timings for patch 0003 were measured
>> with a binary that also had patch 0002 and 0001 applied. To get the
>> impact of each patch in isolation, the delta to the previous run was
>> taken.
>>
>> Code                                | movies |delta  | lineitem | delta
>> ------------------------------------|--------|-------|------------------
>> master                              | 10,311 | 0     | 256,986  | 0
>> v1-0001-Inline-ginCompareAttEntries |  9,694 | 617   | 239,778  | 17,208
>> v1-0002-Optimized-comparison-func   |  9,510 | 184   | 238,094  |  1,684
>> v1-0003-Use-sort_template.h         |  8,661 | 849   | 231,190  |  6,904
>> v1-0004-Avoid-dedup-and-sort-in     |  9,305 | -644  | 232,472  | -1,282
>> v1-0005-Make-btint4cmp-branchless   |  8,240 | 1,065 | 228,387  |  4,085
>> v1-0006-Use-radix-sort              |  6,976 | 1,264 | 207,687  | 20,700
>> v1-0007-Faster-qunique-comparator   |  5,911 | 1,065 | 203,744  |  3,943
>> v1-0008-Add-ASCII-fastpath          |  3,409 | 2,502 | 161,469  | 42,275
>>
>> Attached is v2 of the patch set with the aforementioned changes. I've
>> also fixed the white space errors in 0003, 0004 and 0008, as reported by
>> Kirill.
> 
> Thanks, I pushed patch 0001 now, that's a simple and clear win.
Great. Thanks.

What about the other patches? 0003 and 0007 are also pretty simple and
IMHO uncontroversial while giving decent savings.

For 0006 I would make the code also work for char being unsigned. That's
still missing.

Any thoughts about 0008 and my findings regarding the to lower-case
conversion for ASCII? Adding to pg_locale_struct if it's save to use
ASCII-style tolower should be straight forward and then the code should
be correct.

--
David Geier

Re: Reduce build times of pg_trgm GIN indexes

От

Heikki Linnakangas

Дата:

13 января, 01:10:03

On 09/01/2026 14:06, David Geier wrote:
> On 06.01.2026 18:00, Heikki Linnakangas wrote:
>> On 05/01/2026 17:01, David Geier wrote:
>>> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
>>> of text is actually ASCII. Hence, we provide a fast path for this case
>>> which is exercised if the MSB of the current character is unset.
>>
>> This uses pg_ascii_tolower() when for ASCII characters when built with
>> the IGNORECASE. I don't think that's correct, if the proper collation
>> would do something more complicated for than what pg_ascii_tolower() does.
> 
> Oh, that's evil. I had tested that specifically. But it only worked
> because the code in master uses str_tolower() with
> DEFAULT_COLLATION_OID. So using a different locale like in the following
> example does something different than when creating a database with the
> same locale.
> 
> postgres=# select lower('III' COLLATE "tr_TR");
>   lower
> -------
>   ııı
> 
> postgres=# select show_trgm('III' COLLATE "tr_TR");
>          show_trgm
> -------------------------
>   {"  i"," ii","ii ",iii}
> (1 row)
> 
> But when using tr_TR as default locale of the database the following
> happens:
> 
> postgres=# select lower('III' COLLATE "tr_TR");
>   lower
> -------
>   ııı
> 
> postgres=# select show_trgm('III');sü
>                 show_trgm
> ---------------------------------------
>   {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
> 
> I'm wondering if that's intentional to begin with. Shouldn't the code
> instead pass PG_GET_COLLATION() to str_tolower()? Might require some
> research to see how other index types handle locales.
> 
> Coming back to the original problem: the lengthy comment at the top of
> pg_locale_libc.c, suggests that in some cases ASCII characters are
> handled the pg_ascii_tolower() way for the default locale. See for
> example tolower_libc_mb(). So a character by character conversion using
> that function will yield a different result than strlower_libc_mb(). I'm
> wondering why that is.

Hmm, yeah, that feels funny. The trigram code predates per-column 
collation support, so I guess we never really thought through how it 
should interact with COLLATE clauses.

> Anyways, we could limit the optimization to only kick in when the used
> locale follows the same rules as pg_ascii_tolower(). We could test that
> when creating the locale and store that info in pg_locale_struct.

I think that's only possible for libc locales, which operate one 
character at a time. In ICU locales, lower-casing a character can depend 
on the surrounding characters, so you cannot just test the conversion of 
every ascii character individually. It would make sense for libc locales 
though, and I hope the ICU functions are a little faster anyway.

Although, we probably should be using case-folding rather than 
lower-casing with ICU locales anyway. Case-folding is designed for 
string matching. It'd be a backwards-compatibility breaking change, though.

- Heikki

Re: Reduce build times of pg_trgm GIN indexes

От

David Geier

Дата:

14 января, 15:27:05

Hi Heikki!

On 09.01.2026 22:02, David Geier wrote:
> Given that doing the sort on pre-sorted input apparently doesn't add
> measurable overhead, according to my benchmark results, we can apply
> your patch and leave mine out for the moment being.

I've removed my patch from the patch set in favor of your patch. I've
adapted your patch slightly as follows:

- I replaced the custom for-loop by qunique()
- I switched to sort_template.h which gives a nice speedup as it can do
some things more efficiently now where the entries are simple Datums.
- I use palloc0_array() to initialize the array to GIN_CAT_NORM_KEY in
one go.

Your patch together with my changes gives a 20% speedup on a table with
arrays of 1000 elements and 10% NULLs. See attached test script.

> That's btw. also the reason for why 0002 doesn't show much gain: when
> the data is pre-sorted, cmpEntries() is not called as much.

This turned out to be not the case. I tested 0002 with the attached
script but that neither showed any significant improvements. It's still
curious to me because cmpEntries() is called hundreds of millions of
times and the disassembly shows that the optimized function indeed
directly calls the operator rather than having the indirection via
FunctionCall2Coll().

Anyways, I've removed the patch from the patch set for the moment being.

>>> I would prefer to change qunique() instead. That would enforce using an
>>> adequate comparison function from the get go. There are only ~15 calls
>>> to qunique(). So refactoring this should also be a fairly small patch. I
>>> can do that if there's agreement for that approach.
>>
>> Works for me.
>>
>> At quick glance, most if not all of the qunique() calls call qsort()
>> just before qunique(). I wonder if we should have a single "sort and
>> deduplicate" function, instead. It could perhaps do some deduplication
>> "on the go", or other optimizations.
> 
> If it's just for deduplication purposes and the data doesn't have to end
> up sorted, something based on a hash map should be even faster. How
> about we start with changing the qunique() comparator signature and as a
> 2nd step take a closer look at how to go about providing a function that
> does it in one go?

Thinking about this some more: ideally we have two functions: something
like deduplicateArray() and sortAndDeduplicateArray(). We could
initially implement deduplicateArray() on top of
sortAndDeduplicateArray(). If we ever find a case that needs
optimization, and doesn't require the data to actually end up sorted, we
can implement deduplicateArray() e.g. on top of simplehash.h.

I'll draft a patch and submit it in a separate thread.

> What about the other patches? 0003 and 0007 are also pretty simple and
> IMHO uncontroversial while giving decent savings.

I've reordered the patches such that the ones that I think are
uncontroversial, small and ready to be committed are at the beginning
(patches 0001 - 0004). The radix sort and ASCII fast-path patches come
last (0005 and 0006). I would like to first concentrate on getting 0001
- 0004 in and then get back to 0005 and 0006.

I remeasured the savings of 0001 - 0004, which comes on top of the
already committed patch that inlined the comparison function, which gave
another ~5%:

Data set            | Patched (ms) | Master (ms)  | Speedup
--------------------|--------------|--------------|----------
movies(plot)        |   8,058      |  10,311      | 1.27x
lineitem(l_comment) | 223,233      | 256,986      | 1.19x

--
David Geier

Вложения

Re: Reduce build times of pg_trgm GIN indexes

От

David Geier

Дата:

21 января, 18:45:06

>> Oh, that's evil. I had tested that specifically. But it only worked
>> because the code in master uses str_tolower() with
>> DEFAULT_COLLATION_OID. So using a different locale like in the following
>> example does something different than when creating a database with the
>> same locale.
>>
>> postgres=# select lower('III' COLLATE "tr_TR");
>>   lower
>> -------
>>   ııı
>>
>> postgres=# select show_trgm('III' COLLATE "tr_TR");
>>          show_trgm
>> -------------------------
>>   {"  i"," ii","ii ",iii}
>> (1 row)
>>
>> But when using tr_TR as default locale of the database the following
>> happens:
>>
>> postgres=# select lower('III' COLLATE "tr_TR");
>>   lower
>> -------
>>   ııı
>>
>> postgres=# select show_trgm('III');sü
>>                 show_trgm
>> ---------------------------------------
>>   {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
>>
>> I'm wondering if that's intentional to begin with. Shouldn't the code
>> instead pass PG_GET_COLLATION() to str_tolower()? Might require some
>> research to see how other index types handle locales.
>>
>> Coming back to the original problem: the lengthy comment at the top of
>> pg_locale_libc.c, suggests that in some cases ASCII characters are
>> handled the pg_ascii_tolower() way for the default locale. See for
>> example tolower_libc_mb(). So a character by character conversion using
>> that function will yield a different result than strlower_libc_mb(). I'm
>> wondering why that is.
> 
> Hmm, yeah, that feels funny. The trigram code predates per-column
> collation support, so I guess we never really thought through how it
> should interact with COLLATE clauses.

I've written a patch to fix that. See [1].

>> Anyways, we could limit the optimization to only kick in when the used
>> locale follows the same rules as pg_ascii_tolower(). We could test that
>> when creating the locale and store that info in pg_locale_struct.
> 
> I think that's only possible for libc locales, which operate one
> character at a time. In ICU locales, lower-casing a character can depend
> on the surrounding characters, so you cannot just test the conversion of
> every ascii character individually. It would make sense for libc locales
> though, and I hope the ICU functions are a little faster anyway.
> 
> Although, we probably should be using case-folding rather than lower-
> casing with ICU locales anyway. Case-folding is designed for string
> matching. It'd be a backwards-compatibility breaking change, though.

Oh, I wasn't ware of that. Doing it only for libc locales seems still
useful.

Good point with the casefolding. I'll look into that.

How do we usually go about such backwards-compatibility breaking
changes? Could we have pg_upgrade reindex all GIN indexes? Would that be
acceptable?

[1]
https://www.postgresql.org/message-id/flat/db087c3e-230e-4119-8a03-8b5d74956bc2%40gmail.com

--
David Geier

Re: Reduce build times of pg_trgm GIN indexes

От

Matthias van de Meent

Дата:

21 января, 23:50:54

On Wed, 21 Jan 2026 at 16:45, David Geier <geidav.pg@gmail.com> wrote:
>
> How do we usually go about such backwards-compatibility breaking
> changes?

When it concerns a bug, we mention the change in the release notes
with a warning to reindex affected indexes to be sure no known
corruption remains. See e.g. the final entry in the PG18 release
notes' migration section here:
https://www.postgresql.org/docs/18/release-18.html#RELEASE-18-MIGRATION.

> Could we have pg_upgrade reindex all GIN indexes? Would that be
> acceptable?

No. We'd handle this like any other collation/opclass fixes; we ask
users to reindex their indexes in their own time after they've
upgraded their cluster. Note that in this case it concerns an issue
with just one GIN opclass, not all GIN indexes; so even if we were to
address this in pg_upgrade it wouldn't be a correct choice to reindex
every GIN index, as only a subset of those would be affected by this
issue.

Generally speaking, pg_upgrade doesn't concern itself with the
validity of the data structures that are described by the catalogs
that it upgrades, it only concerns itself with that it correctly
transcribes the catalogs from one version to another, and that the
data files of the old cluster are transfered correctly without
changes.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

Re: Reduce build times of pg_trgm GIN indexes

От

David Geier

Дата:

23 января, 13:18:56

Hi Matthias,

On 21.01.2026 21:50, Matthias van de Meent wrote:
> On Wed, 21 Jan 2026 at 16:45, David Geier <geidav.pg@gmail.com> wrote:
>>
>> How do we usually go about such backwards-compatibility breaking
>> changes?
> 
> When it concerns a bug, we mention the change in the release notes
> with a warning to reindex affected indexes to be sure no known
> corruption remains. See e.g. the final entry in the PG18 release
> notes' migration section here:
> https://www.postgresql.org/docs/18/release-18.html#RELEASE-18-MIGRATION.
> 
>> Could we have pg_upgrade reindex all GIN indexes? Would that be
>> acceptable?
> 
> No. We'd handle this like any other collation/opclass fixes; we ask
> users to reindex their indexes in their own time after they've
> upgraded their cluster. Note that in this case it concerns an issue
> with just one GIN opclass, not all GIN indexes; so even if we were to
> address this in pg_upgrade it wouldn't be a correct choice to reindex
> every GIN index, as only a subset of those would be affected by this
> issue.
> 
> Generally speaking, pg_upgrade doesn't concern itself with the
> validity of the data structures that are described by the catalogs
> that it upgrades, it only concerns itself with that it correctly
> transcribes the catalogs from one version to another, and that the
> data files of the old cluster are transfered correctly without
> changes.

Thanks for the clarifications and the link to the release notes. That's
very helpful. Then I know how to move on and will update the patch
accordingly.

--
David Geier

Re: Reduce build times of pg_trgm GIN indexes

От

David Geier

Дата:

02 марта, 15:17:56

On 23.01.2026 11:18, David Geier wrote:
> Hi Matthias,
> 
> On 21.01.2026 21:50, Matthias van de Meent wrote:
>> On Wed, 21 Jan 2026 at 16:45, David Geier <geidav.pg@gmail.com> wrote:
>>>
>>> How do we usually go about such backwards-compatibility breaking
>>> changes?
>>
>> When it concerns a bug, we mention the change in the release notes
>> with a warning to reindex affected indexes to be sure no known
>> corruption remains. See e.g. the final entry in the PG18 release
>> notes' migration section here:
>> https://www.postgresql.org/docs/18/release-18.html#RELEASE-18-MIGRATION.
>>
>>> Could we have pg_upgrade reindex all GIN indexes? Would that be
>>> acceptable?
>>
>> No. We'd handle this like any other collation/opclass fixes; we ask
>> users to reindex their indexes in their own time after they've
>> upgraded their cluster. Note that in this case it concerns an issue
>> with just one GIN opclass, not all GIN indexes; so even if we were to
>> address this in pg_upgrade it wouldn't be a correct choice to reindex
>> every GIN index, as only a subset of those would be affected by this
>> issue.
>>
>> Generally speaking, pg_upgrade doesn't concern itself with the
>> validity of the data structures that are described by the catalogs
>> that it upgrades, it only concerns itself with that it correctly
>> transcribes the catalogs from one version to another, and that the
>> data files of the old cluster are transfered correctly without
>> changes.
> 
> Thanks for the clarifications and the link to the release notes. That's
> very helpful. Then I know how to move on and will update the patch
> accordingly.

Attached are the patches rebased on latest master.

I've removed the ASCII fast-path patch 0006 as it turned out to be more
complicated to make work than expected.

I kept the radix sort patch because it gives a decent speedup but I
would like to focus for now on getting patches 0001 - 0004 merged.
They're all simple and, the way I see it, uncontroversial.

I remeasured the savings of 0001 - 0004, which comes on top of the
already committed patch that inlined the comparison function, which gave
another ~5%:

Data set            | Patched (ms) | Master (ms)  | Speedup
--------------------|--------------|--------------|----------
movies(plot)        |   8,058      |  10,311      | 1.27x
lineitem(l_comment) | 223,233      | 256,986      | 1.19x

I've also registered the change at the commit fest, see
https://commitfest.postgresql.org/patch/6418/.

--
David Geier

Вложения

Re: Reduce build times of pg_trgm GIN indexes

От

David Geier

Дата:

03 марта, 20:31:51

> Attached are the patches rebased on latest master.
> 
> I've removed the ASCII fast-path patch 0006 as it turned out to be more
> complicated to make work than expected.
> 
> I kept the radix sort patch because it gives a decent speedup but I
> would like to focus for now on getting patches 0001 - 0004 merged.
> They're all simple and, the way I see it, uncontroversial.
> 
> I remeasured the savings of 0001 - 0004, which comes on top of the
> already committed patch that inlined the comparison function, which gave
> another ~5%:
> 
> Data set            | Patched (ms) | Master (ms)  | Speedup
> --------------------|--------------|--------------|----------
> movies(plot)        |   8,058      |  10,311      | 1.27x
> lineitem(l_comment) | 223,233      | 256,986      | 1.19x
> 
> I've also registered the change at the commit fest, see
> https://commitfest.postgresql.org/patch/6418/.

Attached is v5 that removes an incorrect assertion from the radix sort code.

--
David Geier

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Reduce build times of pg_trgm GIN indexes

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения